7  Identification: When Can We Learn from Data?

Status: Draft

v0.4

7.1 Learning Objectives

After reading this chapter, you will be able to:

  • Distinguish between estimand, identification, and estimator
  • Understand graphical criteria for identification (adjustment logic)
  • Decompose total effects into direct and indirect (mediation) effects and understand identification conditions
  • Recognise the limits of “just fit a big model” in the presence of confounding
  • Apply identification theory to determine if a causal question can be answered from available data
  • Use CausalDynamics.jl to check identification and enumerate causal pathways

7.2 Introduction

Not all causal questions can be answered from observational data (Pearl 2009; Shpitser and Pearl 2006; Rothman et al. 2021). This chapter builds the core mindset: estimand → identification → estimator. In the Structural world, we ask: when can we identify perfect prehensive relations (edge structure)—the ideal causal structure—from observable data?

7.3 The Three-Step Framework

7.3.1 1. Estimand

What do we want to estimate?

Example: Average treatment effect (ATE) \[ \text{ATE} = \mathbb{E}[Y^{do(A=1)}] - \mathbb{E}[Y^{do(A=0)}] \]

In the Structural world, we seek to identify perfect causal relations—the ideal forms toward which systems tend.

7.3.2 2. Identification

Can we express the estimand in terms of observable distributions? (Pearl 2009; Shpitser and Pearl 2006)

Identification asks: Is \(P^{do(A=a)}(Y)\) expressible as a function of \(P(Y, A, X)\)?

If yes, the estimand is identified. If no, it is not identified (or only partially identified).

From a Whiteheadian perspective, these graphical criteria work because they reason about prehensive relations (edges) encoded in the graph structure (see Graph Theory and Causal Patterns). The backdoor criterion identifies which edges need to be blocked to isolate causal effects. The edge structure—representing which prehensive relations exist—determines what can be learned from observations and what requires interventions.

7.3.3 3. Estimator

How do we estimate the identified quantity from finite data?

Once identified, we can construct estimators. Common approaches include:

TMLE is particularly valuable because it:

  • Is double robust: Consistent if either outcome or treatment model is correct
  • Achieves semiparametric efficiency: Optimal variance among regular asymptotically linear estimators
  • Can incorporate machine learning: Flexible models (e.g., Super Learner (Laan et al. 2007)) for outcome and treatment
  • Provides valid inference: Confidence intervals and hypothesis tests
  • Handles complex data: High-dimensional confounders, missing data, near-positivity violations

For details on TMLE implementation, see TMLE and Doubly Robust Estimation.

7.4 Graphical Criteria: Adjustment Logic

Given a causal graph, we can determine identification using adjustment criteria (Pearl 2009):

These criteria rely on the graph structure established in Graph Theory and Causal Patterns.

7.4.1 Implementation: Identification with CausalDynamics.jl

The CausalDynamics.jl package provides implementations of these identification criteria. Let’s work through examples of each:

7.4.1.1 Backdoor Criterion

The backdoor criterion identifies valid adjustment sets for estimating causal effects:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))
@auto_using DAGMakie CairoMakie CausalDynamics Graphs

# Confounding example: Z → X → Y, Z → Y
# Nodes: 1=Z, 2=X, 3=Y
g = DiGraph(3)
add_edge!(g, 1, 2)  # Z → X
add_edge!(g, 1, 3)  # Z → Y
add_edge!(g, 2, 3)  # X → Y

# Find backdoor adjustment set for X → Y
adj_set = backdoor_adjustment_set(g, 2, 3)
println("Backdoor adjustment set: ", adj_set)  # Set([1]) = {Z}

# Check if backdoor adjustment is possible
println("Backdoor adjustable: ", is_backdoor_adjustable(g, 2, 3))  # true

# Visualise graph with adjustment set highlighted
let
    # Highlight adjustment set nodes (Z) in yellow, treatment (X) and outcome (Y) in lightblue
    node_colors = [:yellow, :lightblue, :lightblue]
    node_colors[1] = :yellow  # Z (confounder/adjustment set)

    fig, ax, p = dagplot(g;
        figure_size = (600, 400),
        layout_mode = :acyclic,
        node_color = node_colors,
        nlabels = ["Z (confounder)", "X (treatment)", "Y (outcome)"]
    )
    fig  # Only this gets displayed
end
Backdoor adjustment set: Set([1])
Backdoor adjustable: true

Backdoor adjustment example: confounder Z blocks the backdoor path X ← Z → Y

7.4.1.2 Frontdoor Criterion

The frontdoor criterion uses mediators when direct adjustment isn’t possible:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))
@auto_using DAGMakie CairoMakie CausalDynamics Graphs

# Frontdoor example: U → X → M → Y, U → Y
# Nodes: 1=U, 2=X, 3=M, 4=Y
g = DiGraph(4)
add_edge!(g, 1, 2)  # U → X
add_edge!(g, 1, 4)  # U → Y
add_edge!(g, 2, 3)  # X → M
add_edge!(g, 3, 4)  # M → Y

# Check if M is a valid frontdoor adjustment set
is_valid = frontdoor_adjustment_set(g, 2, 4, [3])
println("M is valid frontdoor adjustment: ", is_valid)  # true

# Find potential frontdoor mediators
mediators = find_frontdoor_mediators(g, 2, 4)
println("Frontdoor mediators: ", mediators)  # [Set([3])]

# Visualise graph with frontdoor mediator highlighted
let
    # Highlight mediator (M) in yellow, others in lightblue
    node_colors = [:lightblue, :lightblue, :yellow, :lightblue]

    fig, ax, p = dagplot(g;
        figure_size = (600, 400),
        layout_mode = :acyclic,
        node_color = node_colors,
        nlabels = ["U (unobserved)", "X (treatment)", "M (mediator)", "Y (outcome)"]
    )
    fig  # Only this gets displayed
end
M is valid frontdoor adjustment: false
Frontdoor mediators: Set{Int64}[]

Frontdoor adjustment example: mediator M blocks the path X → M → Y

7.4.1.3 Instrumental Variables

Instrumental variables provide identification when direct adjustment isn’t possible:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))
@auto_using DAGMakie CairoMakie CausalDynamics Graphs

# IV example: Z → X → Y, U → X, U → Y
# Nodes: 1=Z, 2=X, 3=Y, 4=U
g = DiGraph(4)
add_edge!(g, 1, 2)  # Z → X
add_edge!(g, 2, 3)  # X → Y
add_edge!(g, 4, 2)  # U → X
add_edge!(g, 4, 3)  # U → Y

# Find instrumental variables for X → Y
instruments = find_instruments(g, 2, 3)
println("Valid instruments: ", instruments)  # [1] = {Z}

# Visualise graph with instrument highlighted
let
    # Highlight instrument (Z) in yellow, others in lightblue
    node_colors = [:yellow, :lightblue, :lightblue, :lightblue]

    fig, ax, p = dagplot(g;
        figure_size = (600, 400),
        layout_mode = :acyclic,
        node_color = node_colors,
        nlabels = ["Z (instrument)", "X (treatment)", "Y (outcome)", "U (unobserved confounder)"]
    )
    fig  # Only this gets displayed
end
Valid instruments: [1]

Instrumental variable example: Z is a valid instrument for X → Y

7.4.1.4 Time-Varying Confounding Example

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))
@auto_using DAGMakie CairoMakie CausalDynamics Graphs

# Time-varying confounding: L_t → A_t → Y_t, L_t → Y_t
# Nodes: 1=L_t, 2=A_t, 3=Y_t
g = DiGraph(3)
add_edge!(g, 1, 2)  # L_t → A_t
add_edge!(g, 1, 3)  # L_t → Y_t
add_edge!(g, 2, 3)  # A_t → Y_t

# Find backdoor adjustment set for A_t → Y_t
adj_set = backdoor_adjustment_set(g, 2, 3)
println("Adjustment set: ", adj_set)  # Set([1]) = {L_t}

# Verify d-separation: A_t and Y_t are d-separated by L_t
println("A_t ⫫ Y_t | L_t: ", CausalDynamics.d_separated(g, 2, 3, [1]))  # true

# Visualise graph with adjustment set highlighted
let
    # Highlight adjustment set (L_t) in yellow, treatment and outcome in lightblue
    node_colors = [:yellow, :lightblue, :lightblue]

    fig, ax, p = dagplot(g;
        figure_size = (600, 400),
        layout_mode = :acyclic,
        node_color = node_colors,
        nlabels = ["L_t (confounder)", "A_t (treatment)", "Y_t (outcome)"]
    )
    fig  # Only this gets displayed
end
Adjustment set: Set([1])
A_t ⫫ Y_t | L_t: false

Time-varying confounding: L_t blocks the backdoor path A_t ← L_t → Y_t

7.4.2 Template-Based vs Symbolic Identification

The adjustment criteria above are template-based methods—they provide specific graphical patterns that guarantee identification. However, not all identification problems fit these templates.

Do-calculus provides a symbolic approach to identification (Pearl 2009; Shpitser and Pearl 2006). It consists of three rules that allow us to manipulate interventional distributions algebraically (see Do-Calculus: Rules for Interventions):

  1. Insertion/deletion of observations: When certain conditional independences hold
  2. Action/observation exchange: When interventions and observations are equivalent
  3. Insertion/deletion of actions: When interventions don’t affect certain variables

Do-calculus provides a complete (though not always efficient) method for determining identifiability: if a causal effect is identifiable, do-calculus can find an expression for it in terms of observable distributions. If do-calculus cannot find such an expression, the effect is not identifiable (Pearl 2009).

While template-based methods (backdoor, frontdoor, etc.) are often more intuitive and computationally efficient, do-calculus provides the theoretical foundation and handles cases where templates don’t apply.

7.5 Causal Mediation Analysis

Mediation analysis decomposes the total effect of a treatment on an outcome into direct and indirect effects transmitted through intermediate variables (mediators) (VanderWeele 2015; Pearl 2009). This section extends the identification framework to answer: How much of the effect of \(X\) on \(Y\) operates through mediator \(M\) versus directly?

7.5.1 Direct and Indirect Effects

Consider a treatment \(X\), mediator \(M\), and outcome \(Y\) with structure \(X \to M \to Y\) and possibly \(X \to Y\) directly. The total effect (TE) is:

\[ TE = \mathbb{E}[Y \mid do(X=1)] - \mathbb{E}[Y \mid do(X=0)] \]

The Natural Direct Effect (NDE) captures the effect of changing \(X\) while holding \(M\) at its natural value under control:

\[ NDE = \mathbb{E}[Y_{1, M_0}] - \mathbb{E}[Y_{0, M_0}] \]

Here \(Y_{x, M_{x'}}\) denotes the counterfactual outcome when \(X\) is set to \(x\) and \(M\) is set to its value under \(X = x'\). The Natural Indirect Effect (NIE) captures the effect of changing \(M\) from its natural value under control to its natural value under treatment, while holding \(X\) at control:

\[ NIE = \mathbb{E}[Y_{0, M_1}] - \mathbb{E}[Y_{0, M_0}] \]

These effects decompose the total effect on the difference scale:

\[ TE = NDE + NIE \]

The counterfactual definitions use nested counterfactuals \(Y_{x, M_{x'}}\): the outcome when we set \(X = x\) and \(M\) to whatever value it would have taken under \(X = x'\). Under sequential ignorability—no unmeasured confounding of the \(X\)\(M\) relationship and no unmeasured confounding of the \(M\)\(Y\) relationship given \(X\)—the NDE and NIE are identified from observational data (Imai et al. 2010).

7.5.2 Path-Specific Effects

Path-specific effects generalise mediation to arbitrary causal pathways (Avin et al. 2005). For the graph \(X \to M \to Y\) with direct edge \(X \to Y\):

  • Path through M (indirect): \(X \to M \to Y\)
  • Direct path: \(X \to Y\)

The Avin–Shpitser–Pearl framework provides graphical conditions for when path-specific effects are identifiable. A key concept is the recanting witness criterion: a variable \(W\) is a “recanting witness” for path set \(\pi\) if it lies on a path in \(\pi\) and also on a path not in \(\pi\) from \(X\) to \(Y\). When such a witness exists, path-specific effects may not be identifiable without additional assumptions.

Path-specific effects are identifiable when we can express them in terms of observable (or experimentally accessible) distributions using the do-calculus or equivalent graphical criteria.

7.5.3 Sensitivity Analysis for Mediation

A critical concern in mediation analysis is unmeasured confounding of the \(M\)\(Y\) relationship. Even when \(X\) is randomised, unmeasured confounders of \(M\) and \(Y\) can bias estimates of the indirect effect (Imai et al. 2010; VanderWeele 2015).

Sensitivity analysis for mediation introduces parameters that quantify the strength of unmeasured confounding and examines how the NDE and NIE estimates change. This allows researchers to assess robustness: how strong would unmeasured confounding need to be to explain away the observed mediation effect?

For a comprehensive treatment of sensitivity analysis, including mediation-specific approaches, see Sensitivity and Robustness.

7.5.4 Connection to Dynamical Mediation

In dynamical systems, mediation corresponds to temporal pathways through intermediate variables. The state-space model naturally captures mediation through latent states: the transition \(X_t \to X_{t+1}\) may operate partly through an intermediate state \(M_t\), and the observation \(Y_t = h(X_t)\) reflects the mediated pathway.

Continuous-time mediation appears in ODE systems through coupling terms: the effect of one variable on another may be direct (e.g., \(\dot{Y} = f(Y, X)\)) or mediated through an intermediate variable (e.g., \(\dot{M} = g(M, X)\), \(\dot{Y} = h(Y, M)\)). The identification of direct vs indirect pathways in dynamical systems extends the static mediation framework to temporal settings.

7.5.5 Identifying Direct vs Indirect Paths with CausalDynamics.jl

The CausalDynamics.jl package provides path enumeration functions that support mediation analysis. We can use find_directed_paths to enumerate all causal pathways from treatment to outcome, and find_backdoor_paths to identify confounding paths that must be blocked:

# Find project root and include ensure_packages.jl
project_root = let
    current = pwd()
    while !isfile(joinpath(current, "Project.toml")) && !isfile(joinpath(current, "_quarto.yml"))
        parent = dirname(current)
        parent == current && break
        current = parent
    end
    current
end
include(joinpath(project_root, "scripts", "ensure_packages.jl"))
@auto_using CausalDynamics Graphs

# Mediation structure: X → M → Y, X → Y
# Nodes: 1=X (treatment), 2=M (mediator), 3=Y (outcome)
g = DiGraph(3)
add_edge!(g, 1, 2)  # X → M
add_edge!(g, 2, 3)  # M → Y
add_edge!(g, 1, 3)  # X → Y (direct path)

# Enumerate directed (causal) paths from X to Y
directed_paths = CausalDynamics.find_directed_paths(g, 1, 3)
# Result: [[1, 3], [1, 2, 3]]
# - [1, 3]: direct path X → Y
# - [1, 2, 3]: indirect path X → M → Y

# Check for backdoor paths (confounding)
backdoor_paths = CausalDynamics.find_backdoor_paths(g, 1, 3)
# Result: [] — no backdoor paths if no confounders

# With confounder Z: Z → X, Z → M, Z → Y
g_conf = DiGraph(4)
add_edge!(g_conf, 4, 1)  # Z → X
add_edge!(g_conf, 4, 2)  # Z → M
add_edge!(g_conf, 4, 3)  # Z → Y
add_edge!(g_conf, 1, 2)  # X → M
add_edge!(g_conf, 2, 3)  # M → Y
add_edge!(g_conf, 1, 3)  # X → Y

# Backdoor paths from X to Y must be blocked for identification
backdoor_paths_conf = CausalDynamics.find_backdoor_paths(g_conf, 1, 3)
# Identifies paths through Z that require adjustment
5-element Vector{Vector{Int64}}:
 [1, 4, 1, 2, 3]
 [1, 4, 1, 3]
 [1, 4, 2, 3]
 [1, 4, 2, 1, 3]
 [1, 4, 3]

This enumeration helps structure the identification problem: each directed path corresponds to a potential pathway (direct or indirect), and backdoor paths indicate which variables must be adjusted to isolate causal effects.

7.6 The Limits of “Just Fit a Big Model”

A common mistake: “I’ll just fit a flexible model with all variables.”

Problem: Without causal structure, flexible models may:

  • Adjust for colliders (opening backdoor paths)
  • Fail to adjust for confounders
  • Produce biased estimates

Solution: Use causal structure (graph) to guide adjustment.

7.7 Threats to Validity and Bias

Identification theory helps us understand when causal effects can be learned from data. However, even when effects are identified in principle, bias can prevent valid inference in practice (Rothman et al. 2021). Understanding different types of bias is essential for designing studies and interpreting results.

7.7.1 Types of Bias

Epidemiological research distinguishes three main types of bias (Rothman et al. 2021):

  1. Confounding: A confounder is a variable that causes both treatment and outcome, creating a spurious association. This is the focus of identification theory—we use causal graphs to identify confounders and adjust for them.

  2. Selection bias: Occurs when the selection of subjects into the study depends on both treatment and outcome. For example, if only survivors are included in a study, the association between treatment and outcome may be biased.

  3. Information bias (measurement error): Occurs when variables are measured with error, or when measurement error differs between treatment groups. This includes misclassification of exposure, outcome, or confounders.

7.7.2 Confounding in Detail

Confounding is a special concern because it creates spurious associations that can be mistaken for causal effects (Pearl 2009; Rothman et al. 2021). A confounder \(L\) must satisfy three conditions:

  1. \(L\) is associated with treatment \(A\)
  2. \(L\) is associated with outcome \(Y\) (conditional on treatment)
  3. \(L\) is not on the causal pathway from \(A\) to \(Y\)

The backdoor criterion (see Graph Theory and Causal Patterns) provides a graphical method to identify which variables must be adjusted for to eliminate confounding (Pearl 2009). However, in practice, we must also consider:

7.7.3 Sensitivity Analysis for Unmeasured Confounding

When unmeasured confounding is suspected, sensitivity analysis quantifies how robust results are to assumptions about unmeasured confounders (Rothman et al. 2021). Common approaches:

  1. E-value: The minimum strength of association that an unmeasured confounder would need to have with both treatment and outcome to explain away the observed effect (VanderWeele and Ding 2017)

  2. Sensitivity parameters: Specify the strength of unmeasured confounding and compute how results change

  3. Bounds: Compute worst-case bounds on the causal effect under assumptions about unmeasured confounding

Example: If an observed treatment effect has an E-value of 2.5, this means an unmeasured confounder would need to have an odds ratio of at least 2.5 with both treatment and outcome to explain away the effect. This helps assess the plausibility of unmeasured confounding (VanderWeele and Ding 2017).

7.8 Partial Identification

When full identification is impossible, we may still obtain bounds:

  • Non-parametric bounds: Range of possible values
  • Sensitivity analysis: How results change with assumptions
  • Robustness: Worst-case scenarios

7.9 World Context

This chapter addresses Doing in the Structural world—how can we determine when perfect causal structure can be identified from data? Identification is a centrifugal bridge concept (Structural → Observable): it applies structural principles (graph structure, do-calculus) to determine what can be learned from observable data. Identification theory provides the bridge from perfect forms (Structural) to what can be learned from observations (Observable world).

7.10 Key Takeaways

  1. Always separate estimand, identification, and estimator
  2. Graphical criteria provide systematic methods for identification
  3. Template-based methods (backdoor, frontdoor, IV) and symbolic methods (do-calculus) complement each other
  4. Causal mediation analysis decomposes total effects into direct and indirect (path-specific) effects; identification requires sequential ignorability or path-specific graphical criteria
  5. Flexible models are not a substitute for causal structure
  6. Partial identification and bounds are valuable when full identification is impossible
  7. Identification connects perfect Structural forms to observable data
  8. Bias (confounding, selection bias, information bias) can prevent valid inference even when effects are identified
  9. Sensitivity analysis helps assess robustness to unmeasured confounding (including mediation-specific sensitivity)

7.11 Further Reading

  • Pearl (2009): Causality, Chapters 3-4
  • Shpitser and Pearl (2006): “Identification of joint interventional distributions”
  • Bareinboim and Pearl (2012): “Causal transportability”
  • VanderWeele (2015): Explanation in Causal Inference — comprehensive treatment of mediation and path-specific effects
  • Imai et al. (2010): “Identification, inference, and sensitivity analysis for causal mediation effects”
  • Avin et al. (2005): “Identifiability of path-specific effects”
  • Rothman et al. (2021): Modern Epidemiology (4th ed.) — comprehensive coverage of confounding, bias, threats to validity, and causal diagrams
  • Do-Calculus: Rules for Interventions: Symbolic approach to identification