Use HistGradientBoostingRegressor/Classifier

victor5as · victor5as · commit 609d66693d51 · 2024-05-20T12:13:25.000-04:00
diff --git a/docs/source/example_notebooks/gcm_cps2015_dist_change_robust.ipynb b/docs/source/example_notebooks/gcm_cps2015_dist_change_robust.ipynb
@@ -91,7 +91,7 @@
     "\n",
     "The multiply-robust causal change attribution method is based on a combination of *regression* and *re-weighting* approaches. In the regression approach, we learn the dependence between a node and its parents in one sample, and then use the data from the other sample to shift the distribution of that node. In the re-weighting approach, we average the data giving more weight to those observations that closely resemble the target distribution. \n",
     "\n",
-    "By default, ```dowhy.gcm.distribution_change_robust``` uses linear and logistic regression to learn the regression function and the weights. Here, since our dataset is quite large, we will use the more flexible algorithms ```LGBMRegressor``` and ```LGBMClassifier``` instead.\n",
+    "By default, ```dowhy.gcm.distribution_change_robust``` uses linear and logistic regression to learn the regression function and the weights. Here, since our dataset is quite large, we will use the more flexible algorithms ```HistGradientBoostingRegressor``` and ```HistGradientBoostingClassifier``` instead.\n",
     "\n",
     "We also use ```IsotonicRegression``` to calibrate the probabilities that make up the weights for the re-weighting approach on a leave-out calibration sample. This is optional, but it has been shown to improve the performance of the method in simulations.\n",
     "\n",
@@ -107,38 +107,26 @@
    },
    "outputs": [],
    "source": [
-    "from lightgbm import LGBMClassifier, LGBMRegressor\n",
+    "from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier\n",
     "from sklearn.isotonic import IsotonicRegression\n",
     "from dowhy.gcm.ml.classification import SklearnClassificationModelWeighted\n",
     "from dowhy.gcm.ml.regression import SklearnRegressionModelWeighted\n",
     "from dowhy.gcm.util.general import auto_apply_encoders, auto_fit_encoders, shape_into_2d\n",
     "\n",
     "def make_custom_regressor():\n",
-    "    return SklearnRegressionModelWeighted(LGBMRegressor(random_state = 0, n_jobs = -1, verbose = -100))\n",
+    "    return SklearnRegressionModelWeighted(HistGradientBoostingRegressor(random_state = 0))\n",
     "\n",
     "def make_custom_classifier():\n",
-    "    return SklearnClassificationModelWeighted(LGBMClassifier(random_state = 0, n_jobs = -1, verbose = -100, is_unbalance = True))\n",
+    "    return SklearnClassificationModelWeighted(HistGradientBoostingClassifier(random_state = 0))\n",
     "\n",
     "def make_custom_calibrator():\n",
     "    return SklearnRegressionModelWeighted(IsotonicRegression(out_of_bounds = 'clip'))\n",
     "\n",
-    "dist_change_fun_kwargs = {\n",
-    "    'regressor' : LGBMRegressor, \n",
-    "    'regressor_kwargs' : {'random_state' : 0, 'n_jobs' : -1, 'verbose' : -100},\n",
-    "    'classifier' : LGBMClassifier,\n",
-    "    'classifier_kwargs' : {'random_state' : 0, 'n_jobs' : -1, 'is_unbalance' : True, 'verbose' : -100},\n",
-    "    'calibrator' : IsotonicRegression,\n",
-    "    'calibrator_kwargs' : {'out_of_bounds' : 'clip'},\n",
-    "    'calib_size' : 0.2,\n",
-    "    'sample_weight' : 'weight',\n",
-    "    'xfit' : False\n",
-    "}\n",
-    "\n",
     "gcm.distribution_change_robust(causal_model, data_old, data_new, 'wage', sample_weight = 'weight',\n",
-    "                               xfit = False, \n",
+    "                               xfit = False, calib_size = 0.2,\n",
     "                               regressor = make_custom_regressor,\n",
     "                               classifier = make_custom_classifier,\n",
-    "                               calibrator = make_custom_calibrator, calib_size = 0.2)"
+    "                               calibrator = make_custom_calibrator)"
    ]
   },
   {
@@ -366,7 +354,7 @@
     "\n",
     "First, notice that the Shapley values for $P(\\mathtt{educ})$, $P(\\mathtt{occup} \\mid \\mathtt{educ})$ and $P(\\mathtt{wage} \\mid \\mathtt{occup}, \\mathtt{educ})$ add up to the total effect.\n",
     "\n",
-    "Second, the Shapley value for $P(\\mathtt{educ})$ is positive and statistically significant. One way to interpret this measure is that, if men and women differed only in their $P(\\mathtt{educ})$ (but their other causal mechanisms were the same), women would earn \\\\$1.12/hour more than men on average. Conversely, the Shapley value for $P(\\mathtt{educ} \\mid \\mathtt{educ})$ is negative, statistically significant and of slightly larger magnitude as the first Shapley value, hence cancelling out with the effect of differences in education. These effects measure two things:\n",
+    "Second, the Shapley value for $P(\\mathtt{educ})$ is positive and statistically significant. One way to interpret this measure is that, if men and women differed only in their $P(\\mathtt{educ})$ (but their other causal mechanisms were the same), women would earn \\\\$1.13/hour more than men on average. Conversely, the Shapley value for $P(\\mathtt{educ} \\mid \\mathtt{educ})$ is negative, statistically significant and of larger magnitude as the first Shapley value, hence cancelling out with the effect of differences in education. These effects measure two things:\n",
     "1. How different is a causal mechanism between males and females?\n",
     "2. How important is a causal mechanism for the outcome?\n",
     "\n",