28  Policy Evaluation and Dynamic Treatment Strategies

Status: Draft

v0.4

28.1 Learning Objectives

After reading this chapter, you will be able to:

  • Define dynamic treatment strategies (policies)
  • Evaluate policies using causal models
  • Compare different policies
  • Apply G-methods, IPTW, and TMLE to policy evaluation

28.2 Introduction

This chapter focuses on Doing in the Observable world—how can we evaluate policies and treatment strategies? A dynamic treatment strategy (policy) is a rule that maps history to treatment, and we need methods to evaluate these policies using causal models.

28.3 Dynamic Treatment Strategies

28.3.1 Definition

A dynamic treatment strategy (policy) is a rule that maps history to treatment:

\[ A_t = \pi(H_t) \]

where \(H_t = (Y_{1:t}, A_{1:t-1}, L_{1:t})\) is the observed history.

28.3.2 Examples

  • “Treat if severity > threshold”: \(A_t = \mathbb{1}(L_t > \tau)\)
  • “Stop if recovered”: \(A_t = \mathbb{1}(Y_{t-1} = 0)\)
  • “Adaptive dosing”: \(A_t = f(Y_{1:t})\)

28.4 Evaluating Policies

28.4.1 The Policy Evaluation Problem

Question: What is the expected outcome under policy \(\pi\)?

\[ \mathbb{E}[Y^{do(\pi)}] = \mathbb{E}[Y \mid \text{policy } \pi] \]

28.4.2 Answer Using CDM

From a fitted CDM, compute: \[ \mathbb{E}^{do(\pi)}[Y_T] = \int Y_T \, P^{do(\pi)}(Y_T \mid Y_1) \, dY_T \]

Computation: 1. Simulate trajectories under \(do(\pi)\) 2. Average over trajectories 3. Account for uncertainty

28.4.3 Methods for Policy Evaluation

G-computation: Simulate outcomes under policy - Fit outcome and confounder models - Simulate trajectories under policy - Average over simulated outcomes

IPTW: Weight observations by inverse probability of following policy - Fit treatment models - Compute weights: \(w_i = \prod_{t=1}^{T} \frac{1}{P(A_t = \pi(H_t) \mid H_t)}\) - Fit outcome model on weighted data

TMLE: Targeted estimation of policy effects - Fit initial outcome model - Fit treatment model for weights - Targeted update to estimate policy effect - Provides valid inference

28.4.4 Implementation: Policy Evaluation

Here’s an example evaluating different policies using G-computation:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))

@auto_using Random Distributions DifferentialEquations CairoMakie

Random.seed!(42)

# Example: Disease management policies
# Policy 1: Treat if severity > 0.6
# Policy 2: Treat if severity > 0.4
# Policy 3: Always treat

β = 0.2  # Disease progression
α = 0.3  # Treatment effectiveness

# Policy definitions
policy1 = (X, t) -> X > 0.6 ? 1.0 : 0.0  # Treat if severe
policy2 = (X, t) -> X > 0.4 ? 1.0 : 0.0  # Treat if moderate
policy3 = (X, t) -> 1.0  # Always treat

function disease_policy!(du, u, p, t)
    """Disease model with policy function: treatment determined by policy(X, t)."""
    X = u[1]
    policy = p[1]
    A = policy(X, t)
    du[1] = β * X - α * A * X
end

u0 = [0.3]
tspan = (0.0, 20.0)

# Evaluate each policy
policies = [("Treat if X > 0.6", policy1), 
            ("Treat if X > 0.4", policy2),
            ("Always treat", policy3)]
outcomes = Float64[]
trajectories = []

for (name, policy) in policies
    prob = ODEProblem(disease_policy!, u0, tspan, (policy,))  # Using tuple for better performance
    sol = solve(prob, Tsit5())
    push!(outcomes, sol.u[end][1])
    push!(trajectories, (name=name, sol=sol))
end

# Visualise
let
fig = Figure(size = (1200, 400))
ax1 = Axis(fig[1, 1], title = "Policy Trajectories", xlabel = "Time", ylabel = "Severity")
ax2 = Axis(fig[1, 2], title = "Policy Comparison: Final Outcomes", 
           xlabel = "Policy", ylabel = "Final Severity")

colors = [:blue, :green, :red]
for (i, traj) in enumerate(trajectories)
    lines!(ax1, traj.sol.t, [u[1] for u in traj.sol.u], 
           label = traj.name, linewidth = 2, color = colors[i])
end
axislegend(ax1)

barplot!(ax2, 1:3, outcomes, label = [p[1] for p in policies], color = colors)
axislegend(ax2)

    fig  # Only this gets displayed
end

println("Policy evaluation results:")
for (i, (name, _)) in enumerate(policies)
    println("  ", name, ": Final severity = ", round(outcomes[i], digits=3))
end
best_policy = argmin(outcomes)
println("\nBest policy: ", policies[best_policy][1], " (lowest final severity)")
Policy evaluation results:
  Treat if X > 0.6: Final severity = 0.603
  Treat if X > 0.4: Final severity = 0.402
  Always treat: Final severity = 0.041

Best policy: Always treat (lowest final severity)

28.4.5 Value Function

The value function of policy \(\pi\) is: \[ V^\pi = \mathbb{E}^{do(\pi)}\left[\sum_{t=1}^T \gamma^t R_t\right] \]

where \(R_t\) is the reward at time \(t\) and \(\gamma\) is the discount factor.

28.4.6 Comparing Policies

Compare expected outcomes under different policies:

\[ \mathbb{E}[Y^{do(\pi_1)}] \quad \text{vs} \quad \mathbb{E}[Y^{do(\pi_2)}] \]

This allows us to choose the best policy for a given objective.

28.5 Optimal Policy

28.5.1 Definition

The optimal policy maximises expected outcome:

\[ \pi^* = \arg\max_\pi \mathbb{E}^{do(\pi)}[Y_T] \]

28.5.2 Finding Optimal Policies

Methods (Murphy 2003; Schulam and Saria 2017): - Dynamic programming: Value iteration, policy iteration (Bellman 1957; Puterman 2014) - Reinforcement learning: Q-learning, policy gradient (Sutton and Barto 2018; Bertsekas 2019) - Causal methods: Use CDM to evaluate policies

28.6 Off-Policy Evaluation

28.6.1 The Problem

Evaluate policy \(\pi\) using data collected under different policy \(\pi_0\).

28.6.2 Methods (Tennenholtz et al. 2020; Sutton and Barto 2018):

  • Importance sampling: Reweight observations by \(\frac{\pi(A \mid H)}{\pi_0(A \mid H)}\)
  • Doubly robust methods: Combine model-based and importance sampling (e.g., TMLE, augmented IPTW)
  • Causal methods: Use CDM to simulate under \(\pi\)

28.7 Off-Policy Evaluation

28.7.1 The Problem

Evaluate policy \(\pi\) using data collected under different policy \(\pi_0\).

28.7.2 Methods (Tennenholtz et al. 2020; Sutton and Barto 2018):

  • Importance sampling: Reweight observations by \(\frac{\pi(A \mid H)}{\pi_0(A \mid H)}\)
  • Doubly robust methods: Combine model-based and importance sampling (e.g., TMLE, augmented IPTW)
  • Causal methods: Use CDM to simulate under \(\pi\)

28.7.3 Targeted Maximum Likelihood Estimation (TMLE) for Off-Policy Evaluation

TMLE provides a robust approach to off-policy evaluation that combines the strengths of importance sampling and model-based methods (Laan and Rubin 2006; Laan and Rose 2011).

28.7.3.1 TMLE Algorithm for Off-Policy Evaluation

Goal: Estimate \(E[Y^{\pi}]\) (expected outcome under policy \(\pi\)) using data collected under policy \(\pi_0\).

Steps:

  1. Fit initial outcome model: \(Q_0(A, H) = E[Y \mid A, H]\) using data from \(\pi_0\)
  2. Fit behavior policy model: \(g_0(A \mid H) = \pi_0(A \mid H)\) (known or estimated)
  3. Targeted update: Update outcome model to target \(E[Y^{\pi}]\):
    • Compute clever covariate: \(H(A, H) = \frac{\pi(A \mid H)}{g_0(A \mid H)}\)
    • Fit logistic regression: \(\text{logit}(Q_1(A, H)) = \text{logit}(Q_0(A, H)) + \epsilon H(A, H)\)
    • Update: \(Q_1(A, H) = Q_0(A, H) + \epsilon H(A, H)\)
  4. Compute policy value: \[ \hat{E}[Y^{\pi}] = \frac{1}{n} \sum_{i=1}^n \sum_a \pi(a \mid H_i) Q_1(a, H_i) \]

Advantages: - Double robust: Consistent if either outcome model or behavior policy model is correct - Semiparametric efficient: Achieves optimal variance - Valid inference: Provides confidence intervals - Handles policy mismatch: Works even when \(\pi\) and \(\pi_0\) differ substantially

Limitations: - Requires overlap: \(\pi(a \mid H) > 0\) implies \(\pi_0(a \mid H) > 0\) (positivity) - More complex than simple importance sampling

28.7.4 Implementation: Off-Policy Evaluation with Importance Sampling

Here’s an example of off-policy evaluation using importance sampling:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))

@auto_using Random Distributions GLM DataFrames CairoMakie

Random.seed!(42)

# Simulate data collected under behavior policy π₀
# π₀: Treat if severity > 0.5
n = 500
L = rand(Uniform(0, 1), n)  # Severity
A_behavior = Int.(L .> 0.5)  # Behavior policy: treat if L > 0.5
Y = 0.5 .* A_behavior .- 0.3 .* L .+ rand(Normal(0, 0.1), n)  # Outcome

df = DataFrame(L = L, A = A_behavior, Y = Y)

# Target policy π: Treat if severity > 0.3 (more aggressive)
π_target = (L) -> L > 0.3 ? 1 : 0
π_behavior = (L) -> L > 0.5 ? 1 : 0

# Fit behavior policy model g₀(A | L) = π₀(A | L)
g0_model = glm(@formula(A ~ L), df, Binomial(), LogitLink())

# Compute importance sampling weights
# w = π(A | L) / π₀(A | L)
weights = Float64[]
for i in 1:n
    L_i = L[i]
    A_i = A_behavior[i]
    
    # π₀(A | L) from fitted model
    p_behavior = predict(g0_model, DataFrame(L = [L_i]))[1]
    π0_A = A_i == 1 ? p_behavior : (1 - p_behavior)
    
    # π(A | L) from target policy
    # Target policy is deterministic: π_target(L) returns 0 or 1
    # Probability of observed action A_i under target policy
    π_target_prob = π_target(L_i)  # This is 0 or 1 (deterministic policy)
    π_A = A_i == 1 ? π_target_prob : (1 - π_target_prob)
    
    # Weight (with stabilisation to avoid extreme values)
    # Avoid division by zero
    if π0_A > 1e-10
        w = π_A / π0_A
    else
        w = 0.0  # If behavior policy assigns zero probability, weight is zero
    end
    push!(weights, w)
end

# Truncate extreme weights
weights = min.(weights, 10.0)  # Cap at 10

# Estimate policy value: E[Y^π] = E[w * Y]
policy_value = sum(weights .* Y) / sum(weights)

# Compare to naive estimate (ignoring policy mismatch)
naive_value = mean(Y)

# Visualise
let
fig = Figure(size = (1000, 400))
ax1 = Axis(fig[1, 1], title = "Importance Sampling Weights", xlabel = "Weight", ylabel = "Density")
ax2 = Axis(fig[1, 2], title = "Policy Value Estimates", xlabel = "Method", ylabel = "Expected Outcome")

hist!(ax1, weights, bins=30, normalization=:pdf, color=(:blue, 0.5))

barplot!(ax2, [1, 2], [naive_value, policy_value], 
         label = ["Naive\n(ignores policy)", "IS-weighted\n(adjusts for policy)"], 
         color = [:red, :blue])
axislegend(ax2)

    fig  # Only this gets displayed
end

println("Off-policy evaluation:")
println("  Behavior policy π₀: Treat if L > 0.5")
println("  Target policy π: Treat if L > 0.3")
println("  Naive estimate (ignores policy): ", round(naive_value, digits=3))
println("  IS-weighted estimate: ", round(policy_value, digits=3))
Off-policy evaluation:
  Behavior policy π₀: Treat if L > 0.5
  Target policy π: Treat if L > 0.3
  Naive estimate (ignores policy): 0.109
  IS-weighted estimate: 0.158

28.7.5 Advantages of Causal Approach

CDMs enable: - Interventional simulation: Simulate under any policy - No positivity issues: Don’t need policy overlap (can simulate policies not observed) - Uncertainty quantification: Full uncertainty propagation - Mechanistic understanding: Understand why policies work, not just that they work

28.8 World Context

This chapter addresses Doing in the Observable world—how can we evaluate policies? Policy evaluation uses observable data to reason about what will happen under different treatment strategies, bridging the Observable world (what we observe) with the Structural world (what would happen under interventions).

28.9 Key Takeaways

  1. Dynamic treatment strategies: Policies map history to treatment
  2. Policy evaluation: Estimate expected outcomes under policies
  3. Methods: G-computation, IPTW, TMLE all apply to policy evaluation
  4. Comparing policies: Choose best policy for given objective

28.10 Further Reading