Understand the fundamental problem of learning causal structure from observational data
Apply constraint-based methods (PC, FCI) for structure learning
Compare score-based approaches including GES and Bayesian structure learning
Use time series causal discovery methods (Granger, PCMCI, CCM) appropriately
Recognise the assumptions and limitations of causal discovery
Connect discovered structures to the CDM framework
8.2 From Identification to Discovery
Previous chapters assumed the causal graph \(G\) was known. Identification theory (see Identification: When Can We Learn from Data?) tells us when we can answer causal questions given a known structure. But in practice, we often do not know the graph—we must learn it from data. This chapter addresses causal discovery: the problem of inferring causal structure from observational (and sometimes interventional) data (Spirtes et al. 2000; Pearl 2009).
The fundamental challenge is that data alone cannot distinguish all causal structures. Consider three variables \(X\), \(Y\), and \(Z\). The graphs \(X \rightarrow Y \rightarrow Z\), \(X \leftarrow Y \leftarrow Z\), and \(X \leftarrow Y \rightarrow Z\) may all imply the same set of conditional independencies under the Markov assumption. Such graphs form a Markov equivalence class: they are indistinguishable from observational data. Discovery algorithms typically recover this equivalence class rather than a unique DAG.
From the perspective of the three worlds (see The Causal Hierarchy and Three Worlds), discovery aims to recover the Structural world—the DAG encoding prehensive relations—from patterns in the Observable world. The discovered structure then enables interventional and counterfactual reasoning in the CDM framework.
8.3 Constraint-Based Methods
Constraint-based methods infer causal structure by testing conditional independencies in the data. The key insight: if \(X \perp\!\!\!\perp Y \mid Z\) holds in the population, then no direct edge exists between \(X\) and \(Y\) in the true DAG (under faithfulness). By systematically testing such independencies, we can prune edges and orient others.
8.3.1 The PC Algorithm
The PC algorithm(Spirtes et al. 2000) is the canonical constraint-based method. It proceeds in two phases:
Skeleton identification: Start with a complete undirected graph. For each pair \((X, Y)\), test \(X \perp\!\!\!\perp Y \mid S\) for conditioning sets \(S\) of increasing size. If independence holds for some \(S\), remove the edge \(X\)—\(Y\).
Orientation: Apply orientation rules (e.g., v-structures) to direct edges using the pattern of conditional independencies.
The algorithm relies on the faithfulness assumption: the conditional independencies in the data exactly correspond to d-separation in the true DAG. No “accidental” independencies exist beyond those implied by the graph structure.
Pseudocode (conceptual):
PC(D, α):
1. Form complete undirected graph C on vertices V
2. n ← 0
3. repeat:
4. for each pair (X,Y) with edge in C:
5. for each S ⊆ Adj(X)\{Y} with |S| = n:
6. if X ⊥ Y | S (test at level α):
7. Remove edge X—Y, record S as SepSet(X,Y)
8. break
9. n ← n + 1
10. until no changes
11. Orient v-structures: X—Z—Y, X—/—Y ⇒ X→Z←Y
12. Apply further orientation rules (Meek rules)
13. return (partially directed) graph
Complexity: The PC algorithm has complexity \(O(p^d)\) where \(p\) is the number of variables and \(d\) is the maximum degree of the true graph. Sparse graphs admit efficient discovery.
The following chunk illustrates the conceptual structure of constraint-based discovery—testing conditional independencies to prune the skeleton:
# Conceptual structure of PC algorithm: conditional independence testing# Requires: HypothesisTests, Graphs (not guaranteed in minimal environment)# Pseudocode representation: for each pair (X,Y), test X ⊥ Y | S# If independence holds, remove edge and record separating setfunctionpc_skeleton_conceptual(V, independence_test)# Start with complete undirected graph# For n = 0, 1, 2, ... : for each adjacent (X,Y), test X ⊥ Y | S with |S|=n# Remove edge if independence found# Complexity: O(p^d) where d = max degreereturnnothing# Placeholder for conceptual illustrationend# Key parameters: significance level α for conditional independence tests# Common choice: α = 0.05 or 0.01α =0.05
0.05
8.3.2 The FCI Algorithm
When latent confounders may exist, the PC algorithm can produce incorrect conclusions. The FCI (Fast Causal Inference) algorithm (Spirtes et al. 2000; Zhang 2008) extends PC to handle latent variables and selection bias.
FCI outputs a Partial Ancestral Graph (PAG) rather than a DAG. A PAG represents the Markov equivalence class of the underlying DAG with latent variables. Edge marks include:
\(-\) (solid): definite adjacency
\(o\) (circle): possibly arrowhead or tail (uncertain orientation)
\(>\) (arrowhead): definite arrowhead
When to use FCI vs PC:
PC: Use when you assume causal sufficiency—no unobserved common causes. Appropriate for carefully designed experiments or when confounders are measured.
FCI: Use when latent confounders are plausible. FCI correctly represents uncertainty about orientation and can indicate possible confounding.
Interpreting a PAG requires care: a circle (\(o\)) indicates that the algorithm could not determine whether the edge is directed or bidirected from the data. Such ambiguity reflects genuine epistemic limits—observational data cannot resolve certain structural questions without further assumptions or interventions.
8.4 Score-Based Methods
Score-based methods treat structure learning as optimisation: search over DAGs to maximise a score that balances fit to data against model complexity.
where \(\hat{\theta}\) are the maximum-likelihood parameters for \(G\) given data \(D\), and the penalty discourages overly complex graphs.
8.4.1 BIC and AIC
Common scores include:
BIC (Bayesian Information Criterion): \(\text{BIC} = -2 \log \mathcal{L} + k \log n\), where \(k\) is the number of parameters and \(n\) is sample size. BIC is consistent for structure selection under regularity conditions.
AIC: \(\text{AIC} = -2 \log \mathcal{L} + 2k\). AIC tends to select larger models; BIC penalises complexity more strongly.
8.4.2 Greedy Equivalence Search (GES)
GES(Chickering 2002) operates on the space of equivalence classes (represented as completed partially directed acyclic graphs, CPDAGs) rather than individual DAGs. It performs a two-phase greedy search:
Forward phase: Start with empty graph. Repeatedly add the edge that most improves the score.
Backward phase: Repeatedly remove the edge that most improves the score.
GES is score-equivalent: it assigns the same score to Markov-equivalent DAGs. Under faithfulness and with a consistent score (e.g., BIC), GES is consistent for the true equivalence class.
8.4.3 Bayesian Approach
A fully Bayesian approach places a posterior over DAGs:
\[
P(G \mid D) \propto P(D \mid G) \, P(G)
\]
where \(P(D \mid G)\) is the marginal likelihood (integrated over parameters) and \(P(G)\) is a prior over structures. Common priors favour sparser graphs (e.g., uniform over edge count, or a Beta-Binomial prior). MCMC or exact methods can approximate the posterior. This yields not just a point estimate but uncertainty over structures—useful when multiple graphs fit the data similarly.
Comparison with constraint-based methods:
Aspect
Constraint-based (PC)
Score-based (GES)
Assumptions
Faithfulness, causal sufficiency
Faithfulness, parametric model
Output
Equivalence class (CPDAG)
Equivalence class (CPDAG)
Robustness
Sensitive to CI test errors
Sensitive to model misspecification
Scalability
\(O(p^d)\)
Depends on search; often faster in practice
8.5 Time Series Causal Discovery
Causal discovery in time series must account for temporal ordering and autocorrelation. The past can cause the present, but not vice versa—this constraint reduces the space of possible structures.
8.5.1 Granger Causality
Granger causality(Granger 1969) is a predictive notion: \(X\) Granger-causes \(Y\) if past values of \(X\) improve prediction of \(Y\) beyond past values of \(Y\) alone.
Mathematical formulation (bivariate case): Let \(Y_t\) be predicted by:
\(X\) Granger-causes \(Y\) if any \(\beta_k \neq 0\) (reject the null that all \(\beta_k = 0\)).
Limitations:
Confounders: A common cause \(Z\) of \(X\) and \(Y\) can induce apparent Granger causality.
Nonlinearity: Granger causality is linear; nonlinear dependencies may be missed.
Indirect causation: Granger detects temporal precedence, not necessarily direct causation.
8.5.2 PCMCI
PCMCI (Peter and Clark Momentary Conditional Independence) (Runge et al. 2019) combines the PC algorithm with time-lagged dependencies. It:
Builds a superset of parents using a variant of PC with lagged variables.
Applies MCI (Momentary Conditional Independence) tests to remove spurious links, conditioning on the full past.
PCMCI properly handles autocorrelation: conditioning on past values of all variables avoids spurious associations from shared history. The key assumption is causal sufficiency in the time-lagged variable set—no unobserved common causes of the lagged variables.
8.5.3 Convergent Cross Mapping (CCM)
CCM(Sugihara et al. 2012) is designed for nonlinear dynamical systems. It exploits Takens’ embedding theorem: the attractor of a dynamical system can be reconstructed from a single observed time series. CCM tests whether \(Y\)’s attractor can be reconstructed from \(X\)’s time series—if so, \(X\) and \(Y\) are dynamically coupled.
When to use:
Granger: Linear systems, quick screening, well-established interpretation. Suitable when relationships are approximately linear and confounders are controlled.
PCMCI: Multivariate time series, proper handling of autocorrelation, constraint-based approach. Preferred when multiple variables are observed and lagged dependencies matter.
CCM: Nonlinear dynamics, potentially low-dimensional attractors, when linear methods may fail. Particularly useful in ecology and climate science where systems exhibit nonlinear coupling.
Note that Granger causality and structural causation are distinct concepts: \(X\) may Granger-cause \(Y\) without being a direct cause (e.g., through a mediating variable), and \(X\) may cause \(Y\) without Granger-causing it (e.g., in nonlinear systems where the relationship is not captured by linear prediction).
8.6 Discovery Under Interventions
Interventional data provides strictly more information than observational data (Eberhardt et al. 2007). An intervention on \(X\) breaks incoming edges to \(X\), altering the observational distribution. By comparing distributions under different interventions, we can distinguish graphs within the same Markov equivalence class.
Experimental design for structure learning:
Adaptive interventions: Choose which variable to intervene on next based on current uncertainty about the graph.
Minimum intervention sets: Find the smallest set of interventions needed to identify the true DAG.
Active learning of causal structure iteratively selects interventions to maximise information gain, reducing the equivalence class until the true structure is identified. In the best case, \(O(\log p)\) interventions suffice to identify a DAG on \(p\) variables; in the worst case, \(p-1\) interventions may be needed. The structure of the equivalence class determines the minimum number required.
8.7 Assumptions and Limitations
Causal discovery rests on several assumptions that may not hold in practice:
8.7.1 Core Assumptions
Faithfulness: All conditional independencies in the population arise from d-separation in the true DAG. Violations (e.g., near-cancellation of paths, or fine-tuned parameters that create “accidental” independencies) can mislead algorithms into removing true edges or failing to orient correctly.
Causal sufficiency: No unobserved common causes. Violations require FCI or similar methods that output PAGs.
Acyclicity: The true causal structure is a DAG. Feedback loops require extensions (e.g., summary graphs for stationary processes, or dynamic Bayesian networks with time-indexed variables).
8.7.2 What Can and Cannot Be Learned
From observational data alone:
We can typically recover the Markov equivalence class (CPDAG or PAG).
We cannot uniquely determine edge orientations that are equivalent under the Markov property.
We cannot distinguish causation from correlation without further assumptions (e.g., faithfulness, parametric restrictions).
8.7.3 Sample Complexity
Discovery algorithms require sufficient data for reliable conditional independence tests or score estimation. Sample complexity depends on:
Graph sparsity
Effect sizes (weaker effects require more data)
The number of variables \(p\)
8.7.4 Connection to Identifiability
Identification (Identification: When Can We Learn from Data?) asks: given a graph, can we identify a causal effect? Discovery asks: can we identify the graph itself? The two are complementary: discovery provides the structural input for identification; identification tells us what we can learn once we have (or assume) a structure. In practice, analysts often iterate: use discovery to propose a graph, check identification for the effects of interest, then refine the graph or collect interventional data if identification fails.
8.8 Connection to the CDM Framework
Causal discovery is the bridge from the Observable world to the Structural world. Discovered structures feed the CDM (Causal Dynamical Model) framework:
Discovery yields a candidate DAG (or equivalence class) from data.
Modelling instantiates the structure with parametric or nonparametric dynamical laws.
Validation checks consistency with new data and domain knowledge.
Refinement iterates: discovery → model → validate → discover.
The discovered structure enables interventional reasoning (what if we set \(X = x\)?) and counterfactual reasoning (what would have happened if \(X\) had been different?). Without a structural hypothesis, these questions cannot be addressed. Discovery provides that hypothesis—with appropriate epistemic humility about equivalence classes and assumptions.
In the process philosophy framing of this book, discovery aims to recover the prehensive structure—the pattern of how occasions prehend one another—from its observable traces. The structure is not merely a convenient summary; it encodes the causal architecture that constrains what can happen under intervention and counterfactual supposition.
Discovery is thus an inverse problem: we observe the outer world (associations, conditional independencies) and infer the inner structural world. Like all inverse problems, it is ill-posed without assumptions. Faithfulness, causal sufficiency, and acyclicity provide the regularisation that makes the problem tractable. When these assumptions hold approximately, discovery algorithms offer a principled route from data to structure—and from structure to the interventional and counterfactual reasoning that the CDM framework enables.
Chickering, David Maxwell. 2002. “Optimal Structure Identification with Greedy Search.”Journal of Machine Learning Research 3: 507–54.
Eberhardt, Frederick, Clark Glymour, and Richard Scheines. 2007. “On the Number of Experiments Required to Identify the Causal Structure of a System.”Journal of Machine Learning Research 8: 2651–98.
Granger, Clive W. J. 1969. “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods.”Econometrica 37 (3): 424–38.
Pearl, Judea. 2009. Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
Runge, Jakob, Sebastian Bathiany, Erik Bollt, et al. 2019. “Detecting and Quantifying Causal Associations in Large Nonlinear Time Series Datasets.”Science Advances 5 (11): eaau4996. https://doi.org/10.1126/sciadv.aau4996.
Spirtes, Peter, Clark Glymour, and Richard Scheines. 2000. Causation, Prediction, and Search. 2nd ed. MIT Press.
Sugihara, George, Robert May, Hao Ye, et al. 2012. “Detecting Causality in Complex Ecosystems.”Science 338 (6106): 496–500. https://doi.org/10.1126/science.1227079.
Zhang, Jiji. 2008. “On the Completeness of Orientation Rules for Causal Discovery in the Presence of Latent Confounders and Selection Bias.”Artificial Intelligence 172 (16): 1873–96. https://doi.org/10.1016/j.artint.2008.08.001.