GitHub - wilfred11/csl: Project on causal discovery

Causal discovery

A causal graph contains nodes and edges which link nodes that are causally related.

Assumptions

The Causal Markov Condition implies that once you know the direct causes of a variable, any other information about its non-descendants becomes irrelevant in predicting its value.

Causal Faithfulness in causal inference states that the conditional independence relationships observed in a probability distribution accurately reflect the causal structure.

Causal Sufficience refers to the assumption that all relevant confounders (unobserved variables that influence both the cause and effect) have been measured and included in the analysis.

Acyclicity means there can be no cycle in the causal graph.

Knowledge

A Markov equivalence class, in the context of graphical models, is a set of Directed Acyclic Graphs (DAGs) that all represent the same set of conditional independence relationships between variables. These DAGs, despite potentially having different edge orientations, share the same skeleton (undirected graph with the same edges) and the same "v-structures" (patterns where two converging arrows point to the same node). A Completed Partially Directed Acyclic Graph (CPDAG) can uniquely represent a Markov equivalence class.

Conditional independence tests

Conditional independence (CI) tests are crucial for causal discovery, as they help identify relationships between variables by determining if they are statistically independent given a set of other variables. These tests form the foundation of constraint-based causal discovery algorithms like the PC algorithm, which use CI relationships to infer causal structures.

Null hypothesis

The null hypothesis assumes that there is no association between the two variables of interest. A p-value can then be calculated and if it is below 0.05 the null hypothesis will be rejected suggesting that there is significant association between the variables.

Experiment

In this test I will try to find the dependencies between variables altitude, latitude and temperature. I found some dataset containing information on 500+ cities in Brazil. For every city it contains latitude, altitude and average temperature, I will try to find a causal graph that captures dependencies.

Predicted outcome

Google's AI engine states Latitude and altitude both have a significant impact on temperature.

Generally, as latitude increases (moving away from the equator towards the poles), temperatures tend to decrease. Similarly, temperature decreases with increasing altitude, meaning higher altitudes are typically colder. The causal graph for this mechanism looks like in the picture directly below.

Own experiments

Before anything else a pairplot should give some idea. For every 2 columns a pairplot shows how these two columns relates. The diagonal plots show the histogram for every column.

To get an idea of the associations between variables I have performed a (conditional) independency test for every combination possible. For this test I have used the "dowhy" package.

marginal independency test:

import dowhy.gcm as gcm gcm.independence_test(data[x], data[y], method="kernel")

conditional independency test:

gcm.independence_test(data[x], data[y],conditioned_on=data[z], method="kernel")

x	y	conditioned_on	p	assoc
altitude	latitude		0.31	no
altitude	temperature		0.08	no
latitude	temperature		0.05	yes
altitude	latitude	temperature	0.79	no
altitude	temperature	latitude	0.0	yes
latitude	temperature	altitude	0.12	no

Given this data I am able to produce a graph that looks like the graph directly below. I must admit using another statistical method (currently "kernel"), I get different results. The p-values are very close to the critical values so it is a matter of tens of rows to get another result. Brazil does not count that many cities located on mountains or in mountainous area.

Using the PC algorithm as implemented by causallearn

Using the PC algorithm as mentioned directly below, I get a result like in my own experiments. I really think I need data that is more global.

from causallearn.search.ConstraintBased.PC import pc

cg = pc(temp_data[relevant_cols].to_numpy())

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
.gitignore		.gitignore
README.md		README.md
discover.py		discover.py
main.py		main.py
pyproject.toml		pyproject.toml
temp_alt.py		temp_alt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Causal discovery

Assumptions

Knowledge

Conditional independence tests

Null hypothesis

Experiment

Predicted outcome

Own experiments

Using the PC algorithm as implemented by causallearn

About

Uh oh!

Releases

Packages

Languages

wilfred11/csl

Folders and files

Latest commit

History

Repository files navigation

Causal discovery

Assumptions

Knowledge

Conditional independence tests

Null hypothesis

Experiment

Predicted outcome

Own experiments

Using the PC algorithm as implemented by causallearn

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages