Add causal-predictive metrics dataset for others to analyze

bradyneal · Mar 27, 2021 · eec6005 · eec6005
1 parent cdeb575
commit eec6005
Show file tree

Hide file tree

Showing 3 changed files with 1,589 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -27,6 +27,25 @@ If step 1 above fails, try the following instead: `conda env create -f environme
 └── utils.py		- various utility functions
 ```
 
+## Do your own analysis on our causal-predictive metric dataset
+
+We trained a total of 1568 different estimators.
+We recorded all of the predictive metrics that sklearn provides (e.g. RMSE, MAE, precision, recall, etc.) and many different causal metrics that RealCause provides (e.g. ATE bias, ATE RMSE, PEHE, etc.).
+Taking all of these metrics plus estimator specification (meta-estimator, outcome model, and propensity score model) yields a total of 77 columns.
+Cells are "nan" where that cell doesn't make sense (e.g. the propensity score model cell for a standardization estimator, a regression metric for an IPW estimator, a classification metric for a standardization estimator, etc.).
+
+We provide this dataset in [causal-predictive-analysis.csv](https://github.com/bradyneal/causal-benchmark/blob/master/causal-predictive-analysis.csv).
+We did one analysis on this dataset in Section 6 of our paper (in [experiments/uai_analysis.py](https://github.com/bradyneal/causal-benchmark/blob/master/experiments/uai_analysis.py)).
+However, there are many more possible analyses that can be run on it.
+For example, one might want to fit machine learning models to predict causal metrics from predictive metrics and use something like [SHAP](https://github.com/slundberg/shap) to interpret the associations these models find.
+To get started, simply load the dataset from [causal-predictive-analysis.csv](https://github.com/bradyneal/causal-benchmark/blob/master/causal-predictive-analysis.csv).
+Example loading:
+
+```
+import pandas as pd
+
+df = pd.read_csv('causal-predictive-analysis.csv')
+```
 
 ## Loading RealCause pre-computed datasets