26 Model Validation with Observable Data

Status: Draft

v0.4

26.1 Learning Objectives

After reading this chapter, you will be able to:

Apply calibration, posterior predictive checks, and residual analysis
Recognise residual structure and out-of-domain validation
Validate models under interventions
Make “finding what’s wrong” a first-class skill

26.2 Introduction

Model criticism is essential: all models are wrong, but some are useful (Gelman et al. 2013). This chapter teaches how to find what’s wrong with your model and whether it’s useful for your question. This is part of Seeing in the Observable world—validating what we’ve learned from actualised data.

26.3 Why Model Criticism?

Models are approximations. Model criticism helps us:

Find failures: Where does the model break?
Assess usefulness: Is the model good enough for the question?
Guide improvement: What should we fix?
Build trust: Can we trust the model’s predictions?

26.4 Calibration

26.4.1 What Is Calibration?

A model is calibrated if predicted probabilities match observed frequencies (Gelman et al. 2013).

Example: If the model predicts 80% probability of an event, the event should occur ~80% of the time.

26.4.2 How to Check

Calibration plots: Predicted vs observed probabilities (Gelman et al. 2013)
Reliability diagrams: Binned predictions vs observed frequencies
Scoring rules: Brier score, log score (Vehtari et al. 2017)

26.4.3 Implementation: Calibration Plots

We can check calibration by comparing predicted probabilities to observed frequencies:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))

@auto_using Random Distributions GLM DataFrames CairoMakie

Random.seed!(42)

# Simulate data
n = 1000
X = rand(Normal(0, 1), n)
p_true = 1 ./ (1 .+ exp.(-(0.5 .+ 0.8 .* X)))  # True probability
Y = rand.(Bernoulli.(p_true))

df = DataFrame(X = X, Y = Y)

# Fit model
model = glm(@formula(Y ~ X), df, Binomial(), LogitLink())
p_pred = predict(model, df)  # Predicted probabilities

# Bin predictions and compute observed frequencies
n_bins = 10
bins = collect(range(0, 1, length = n_bins + 1))
bin_centers = (bins[1:end-1] .+ bins[2:end]) ./ 2
observed_freq = Float64[]
predicted_mean = Float64[]

for i in 1:n_bins
    # Handle last bin to include upper bound
    if i == n_bins
        in_bin = (p_pred .>= bins[i]) .& (p_pred .<= bins[i+1])
    else
        in_bin = (p_pred .>= bins[i]) .& (p_pred .< bins[i+1])
    end
    if sum(in_bin) > 0
        push!(observed_freq, mean(Y[in_bin]))
        push!(predicted_mean, mean(p_pred[in_bin]))
    end
end

# Visualise calibration
let
fig = Figure(size = (600, 600))
ax = Axis(fig[1, 1], title = "Calibration Plot", 
          xlabel = "Predicted Probability", ylabel = "Observed Frequency",
          aspect = 1)

scatter!(ax, predicted_mean, observed_freq, markersize = 10, color = :blue)
lines!(ax, [0, 1], [0, 1], color = :red, linestyle = :dash, linewidth = 2, label = "Perfect calibration")
axislegend(ax)

    fig  # Only this gets displayed
end

# Compute Brier score
brier_score = mean((p_pred .- Y).^2)
println("Calibration assessment:")
println("  Brier score: ", round(brier_score, digits=4))
println("  (Lower is better; perfect = 0)")

Calibration assessment:
  Brier score: 0.2089
  (Lower is better; perfect = 0)

26.5 Posterior Predictive Checks (PPCs)

26.5.1 The Idea

Compare observed data to data simulated from the fitted model (Gelman et al. 2013; Gabry et al. 2019): \[ Y^{\text{rep}} \sim P(Y \mid \theta_{\text{posterior}}) \]

If \(Y^{\text{rep}}\) looks like \(Y^{\text{obs}}\), the model captures the data.

26.5.2 Test Statistics

Choose test statistics \(T(Y)\) that capture important features:

Means, variances: First and second moments
Autocorrelations: Temporal structure
Extrema: Rare events
Domain-specific: Scientific quantities of interest

26.5.3 Interpretation

\(p\)-values: Extreme values (near 0 or 1) indicate problems
Visual comparison: Do simulated data look like observed data?
Multiple statistics: Check many aspects

26.5.4 Implementation: Posterior Predictive Checks

We can compare observed data to data simulated from the fitted model:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))

@auto_using Random Distributions GLM DataFrames CairoMakie

Random.seed!(42)

# Simulate observed data
n = 200
X = rand(Normal(0, 1), n)
Y_true = 1.0 .+ 0.5 .* X .+ rand(Normal(0, 0.3), n)

df = DataFrame(X = X, Y = Y_true)

# Fit model
model = lm(@formula(Y ~ X), df)
Y_pred = predict(model, df)
σ_residual = std(residuals(model))

# Simulate from fitted model (posterior predictive)
n_sims = 1000
Y_rep = zeros(n, n_sims)

for sim in 1:n_sims
    Y_rep[:, sim] = Y_pred .+ rand(Normal(0, σ_residual), n)
end

# Test statistics
function test_stat_mean(Y)
    """Test statistic: mean of observations."""
    return mean(Y)
end

function test_stat_max(Y)
    """Test statistic: maximum of observations."""
    return maximum(Y)
end

T_obs_mean = test_stat_mean(Y_true)
T_obs_max = test_stat_max(Y_true)

T_rep_mean = [test_stat_mean(Y_rep[:, sim]) for sim in 1:n_sims]
T_rep_max = [test_stat_max(Y_rep[:, sim]) for sim in 1:n_sims]

# p-values (proportion of T_rep >= T_obs)
p_value_mean = mean(T_rep_mean .>= T_obs_mean)
p_value_max = mean(T_rep_max .>= T_obs_max)

# Visualise
let
fig = Figure(size = (1000, 400))
ax1 = Axis(fig[1, 1], title = "PPC: Mean", xlabel = "Test Statistic", ylabel = "Density")
ax2 = Axis(fig[1, 2], title = "PPC: Maximum", xlabel = "Test Statistic", ylabel = "Density")

hist!(ax1, T_rep_mean, bins=30, normalization=:pdf, color=(:blue, 0.5), label="Simulated")
vlines!(ax1, [T_obs_mean], color=:red, linewidth=2, label="Observed")
axislegend(ax1)

hist!(ax2, T_rep_max, bins=30, normalization=:pdf, color=(:blue, 0.5), label="Simulated")
vlines!(ax2, [T_obs_max], color=:red, linewidth=2, label="Observed")
axislegend(ax2)

    fig  # Only this gets displayed
end

println("Posterior predictive check:")
println("  Mean: p-value = ", round(p_value_mean, digits=3))
println("  Maximum: p-value = ", round(p_value_max, digits=3))
println("  (Extreme p-values near 0 or 1 indicate problems)")

Posterior predictive check:
  Mean: p-value = 0.499
  Maximum: p-value = 0.707
  (Extreme p-values near 0 or 1 indicate problems)

26.6 Residual Analysis

26.6.1 What Are Residuals?

Residuals measure how well the model fits: \[ r_t = Y_t - \mathbb{E}[Y_t \mid X_t, \theta] \]

26.6.2 What to Check

Mean: Should be near zero
Variance: Should be constant (homoscedasticity)
Autocorrelation: Should be near zero (no temporal structure)
Distribution: Should match observation model

26.6.3 Residual Structure

Problem: Residuals show structure (trends, cycles, correlations)

Implication: Model is missing important features

Solution: Add missing components (trends, seasonality, interactions)

26.6.4 Implementation: Residual Analysis

We can check residuals for structure that indicates model problems:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))

@auto_using Random Distributions GLM DataFrames CairoMakie

Random.seed!(42)

# Simulate data with missing quadratic term (model misspecification)
n = 200
X = rand(Uniform(-2, 2), n)
Y = 1.0 .+ 0.5 .* X .+ 0.3 .* X.^2 .+ rand(Normal(0, 0.2), n)  # True model has X²

df = DataFrame(X = X, Y = Y)

# Fit misspecified model (missing X² term)
model = lm(@formula(Y ~ X), df)
Y_pred = predict(model, df)
residuals = Y .- Y_pred

# Check residuals
residual_mean = mean(residuals)
residual_std = std(residuals)

# Check autocorrelation (for time series, here we check against X)
# In time series, check lag-1 autocorrelation
# Here, we check correlation with X to detect missing terms
residual_X_corr = cor(residuals, X)

# Visualise
let
fig = Figure(size = (1200, 400))
ax1 = Axis(fig[1, 1], title = "Residuals vs X", xlabel = "X", ylabel = "Residuals")
ax2 = Axis(fig[1, 2], title = "Residuals vs Predicted", xlabel = "Predicted Y", ylabel = "Residuals")
ax3 = Axis(fig[1, 3], title = "Residual Distribution", xlabel = "Residuals", ylabel = "Density")

scatter!(ax1, X, residuals, color = :blue, alpha = 0.5)
hlines!(ax1, [0], color = :red, linestyle = :dash, linewidth = 2)

scatter!(ax2, Y_pred, residuals, color = :blue, alpha = 0.5)
hlines!(ax2, [0], color = :red, linestyle = :dash, linewidth = 2)

hist!(ax3, residuals, bins=20, normalization=:pdf, color=(:blue, 0.5))
# Overlay normal distribution
x_norm = range(minimum(residuals), maximum(residuals), length=100)
y_norm = pdf.(Normal(residual_mean, residual_std), x_norm)
lines!(ax3, x_norm, y_norm, color=:red, linewidth=2, label="Normal")
axislegend(ax3)

    fig  # Only this gets displayed
end

println("Residual analysis:")
println("  Mean: ", round(residual_mean, digits=4), " (should be ≈ 0)")
println("  Std: ", round(residual_std, digits=4))
println("  Correlation with X: ", round(residual_X_corr, digits=4), " (should be ≈ 0)")
println("\nInterpretation:")
if abs(residual_X_corr) > 0.1
    println("  ⚠️  Residuals correlated with X → missing term (e.g., X²)")
else
    println("  ✓ Residuals appear independent of X")
end

Residual analysis:
  Mean: -0.0 (should be ≈ 0)
  Std: 0.4089
  Correlation with X: -0.0 (should be ≈ 0)

Interpretation:
  ✓ Residuals appear independent of X

26.7 Out-of-Domain Validation

26.7.1 The Problem

Models often fail when applied to new domains:

Different time periods: Temporal generalisation
Different populations: Population generalisation
Different conditions: Condition generalisation

26.7.2 How to Validate

Temporal split: Train on past, validate on future
Population split: Train on one population, validate on another
Condition split: Train on one condition, validate on another

26.8 World Context

This chapter addresses Seeing in the Observable world—validating what we’ve learned from actualised data. Model validation uses observable data to test whether inferred mechanisms are consistent with observations, bridging the Observable world (what we observe) with the Structural/Dynamical worlds (what mechanisms exist).

26.9 Key Takeaways

Calibration: Predicted probabilities should match observed frequencies
Posterior predictive checks: Compare simulated to observed data
Residual analysis: Check for structure in residuals
Out-of-domain validation: Test generalisation to new domains
Model criticism is essential: Find what’s wrong before using the model

26.10 Further Reading

Gelman et al. (2013): Bayesian Data Analysis — Model criticism
Gabry et al. (2019): “Visualization in Bayesian workflow”
Vehtari et al. (2017): “Practical Bayesian model evaluation”
Observational Methods: Learning from Data: Methods that need validation
TMLE and Doubly Robust Estimation: Robust methods that still need validation