Skip to Content

The Causal AI conference is back in San Francisco for 2024, bigger and better than ever.

Register Interest

Discovering Causal Relationships

Observational Data

In observational studies, researchers observe the exposure and outcome variables without any intervention or manipulation. They simply observe the data as it naturally occurs. Therefore, the distribution of potential outcomes in an observational study is based on the natural variation in the exposure variable and any other confounding variables.

In contrast, in interventional studies, researchers manipulate the exposure variable to observe the effect on the outcome variable. This manipulation creates a new distribution of potential outcomes that is specific to the intervention.

To understand the difference between the two, consider an example of a study looking at the effect of a new medication on blood pressure. In an observational study, researchers would simply observe the blood pressure levels of individuals who have been prescribed the medication and those who have not. In this case, the distribution of potential outcomes would be based on the natural variation in blood pressure levels between the two groups.

In an interventional study, researchers would randomly assign individuals to either the medication or placebo group and observe the effect on their blood pressure levels. In this case, the distribution of potential outcomes would be based on the effect of the medication on blood pressure levels.

The fact that we are exposed to confounding biases in observational data means that it’s very important to use causal discovery to be able to correctly estimate causal effects. In the blood pressure example, individuals who chose to take blood pressure medication may be themselves already taking more active lifestyle choices to lower their blood pressure, confounding the effect. Controlling for these factors is necessary to get an unbiased estimate of the effect of the medication.

Causal Discovery

Causal discovery is the process of inferring causal relationships between variables from observational data. It is a fundamental task in data science, as it can help identify the drivers of certain phenomena and aid in decision-making. Two classes of methods for causal discovery from observational data are constraint-based and score-based methods.

Constraint-based methods aim to discover causal relationships by identifying causal structures that satisfy certain constraints imposed by the data. The most prominent example of constraint-based methods is the PC algorithm, which is based on conditional independence tests and graph theory. The PC algorithm first identifies all pairwise conditional independence relationships between variables and then uses graph theory to identify a set of causal structures that are consistent with the observed conditional independence relationships.

Score-based methods aim to find causal structures that maximize a certain score function based on the data. One of the popular score-based methods is the Greedy Equivalence Search (GES) algorithm, which uses a score function to iteratively add and remove edges from a graph until it reaches a locally optimal structure. GES is known for its efficiency and ability to handle large datasets. Another score-based method is the A* algorithm, which is based on the idea of searching for the optimal causal structure in a graph space. A* uses a heuristic search algorithm to efficiently explore the space of possible causal structures and identify the most likely structure that satisfies certain constraints.

In summary, causal discovery from observational data is a critical task in data science, and constraint-based and score-based methods are two classes of approaches for inferring causal relationships from such data. Constraint-based methods identify causal structures that satisfy certain constraints imposed by the data, while score-based methods aim to find causal structures that maximize a score function based on the data. Prominent examples of constraint-based methods include the PC algorithm, while GES and A* are examples of efficient and effective score-based methods.

Causal Sufficiency

Causal sufficiency refers to the condition where all the common causes (or confounders) of the treatment and outcome variables are observed and properly measured. Ensuring causal sufficiency is crucial for obtaining unbiased estimates of causal effects. If causal sufficiency is not met, it implies that there are unobserved confounders that can bias the estimated relationship between the treatment and the outcome.

Causal insufficiency occurs when not all common causes of the treatment and outcome are accounted for in the analysis. This can lead to biased estimates of causal effects due to confounding, as the relationship between the treatment and the outcome may be influenced by these unobserved factors. Causal insufficiency is a common problem in observational studies where random assignment of treatment is not possible.

Various causal discovery techniques (such as FCI) give researchers an indication as to when relationships between variables cannot be resolved, which would indicate that there are latent confounders that aren’t present in the data. Additionally, with Human Guided Causal Discovery, practitioners are able to intervene in the causal graph construction to specify whenever there are relationships that can only be explained by external confounding factors.

image description