Skip to content

Commit c2ddd5a

Browse files
committed
Add explicit support for discrete ANMs
- Add new Discrete Additive Noise Model class that enforces the outputs to be discrete. This should help in generating more consistent data. - As part of this, revised the auto assignment function and revised its docstring. - Revised the auto assignment summary. Signed-off-by: Patrick Bloebaum <bloebp@amazon.com>
1 parent 7c015b7 commit c2ddd5a

File tree

9 files changed

+434
-81
lines changed

9 files changed

+434
-81
lines changed

docs/source/user_guide/modeling_gcm/model_evaluation.rst

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -52,29 +52,49 @@ this, consider the chain structure example X→Y→Z:
5252

5353
.. code-block::
5454
55-
Analyzed 3 nodes.
55+
When using this auto assignment function, the given data is used to automatically assign a causal mechanism to each node. Note that causal mechanisms can also be customized and assigned manually.
56+
The following types of causal mechanisms are considered for the automatic selection:
57+
58+
If root node:
59+
An empirical distribution, i.e., the distribution is represented by randomly sampling from the provided data. This provides a flexible and non-parametric way to model the marginal distribution and is valid for all types of data modalities.
60+
61+
If non-root node and the data is continuous:
62+
Additive Noise Models (ANM) of the form X_i = f(PA_i) + N_i, where PA_i are the parents of X_i and the unobserved noise N_i is assumed to be independent of PA_i.To select the best model for f, different regression models are evaluated and the model with the smallest mean squared error is selected.Note that minimizing the mean squared error here is equivalent to selecting the best choice of an ANM.
63+
64+
If non-root node and the data is discrete:
65+
Discrete Additive Noise Models have almost the same definition as non-discrete ANMs, but come with an additional constraint to return discrete values.
66+
Note that 'discrete' here refers to numerical values with an order. If the data is categorical, consider representing them as strings to ensure proper model selection.
67+
68+
If non-root node and the data is categorical:
69+
A functional causal model based on a classifier, i.e., X_i = f(PA_i, N_i).
70+
Here, N_i follows a uniform distribution on [0, 1] and is used to randomly sample a class (category) using the conditional probability distribution produced by a classification model.Here, different model classes are evaluated using the (negative) F1 score and the best performing model class is selected.
71+
72+
In total, 3 nodes were analyzed:
73+
5674
--- Node: X
57-
Node X is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
75+
Node X is a root node. Therefore, assigning 'Empirical Distribution' to the node representing the marginal distribution.
5876
5977
--- Node: Y
60-
Node Y is a non-root node. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
78+
Node Y is a non-root node with continuous data. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
6179
This represents the causal relationship as Y := f(X) + N.
6280
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
63-
LinearRegression: 1.0023387259040388
81+
LinearRegression: 0.9978767184153945
6482
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
65-
('linearregression', LinearRegression)]): 1.0099017476403862
66-
HistGradientBoostingRegressor: 1.1091403766880177
67-
Based on the type of causal mechanism, the model with the lowest metric value represents the best choice.
83+
('linearregression', LinearRegression)]): 1.00448207264867
84+
HistGradientBoostingRegressor: 1.1386270868995179
6885
6986
--- Node: Z
70-
Node Z is a non-root node. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
87+
Node Z is a non-root node with continuous data. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
7188
This represents the causal relationship as Z := f(Y) + N.
7289
For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
73-
LinearRegression: 0.9451918596711175
90+
LinearRegression: 1.0240822102491627
7491
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
75-
('linearregression', LinearRegression)]): 0.9488259577453813
76-
HistGradientBoostingRegressor: 1.682146254853607
77-
Based on the type of causal mechanism, the model with the lowest metric value represents the best choice.
92+
('linearregression', LinearRegression)]): 1.02567150836141
93+
HistGradientBoostingRegressor: 1.358002751994007
94+
95+
===Note===
96+
Note, based on the selected auto assignment quality, the set of evaluated models changes.
97+
For more insights toward the quality of the fitted graphical causal model, consider using the evaluate_causal_model function after fitting the causal mechanisms.
7898
7999
In this scenario, an empirical distribution is assigned to the root node X, while additive noise models are applied
80100
to nodes Y and Z. In both of these cases, a linear regression model demonstrated the best performance in terms

dowhy/gcm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
MedianDeviationScorer,
1111
RescaledMedianCDFQuantileScorer,
1212
)
13-
from .causal_mechanisms import AdditiveNoiseModel, ClassifierFCM, PostNonlinearModel
13+
from .causal_mechanisms import AdditiveNoiseModel, ClassifierFCM, DiscreteAdditiveNoiseModel, PostNonlinearModel
1414
from .causal_models import InvertibleStructuralCausalModel, ProbabilisticCausalModel, StructuralCausalModel
1515
from .confidence_intervals import confidence_intervals
1616
from .confidence_intervals_cms import bootstrap_sampling, fit_and_compute

dowhy/gcm/auto.py

Lines changed: 153 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
from sklearn.preprocessing import MultiLabelBinarizer
1515

1616
from dowhy.gcm import config
17-
from dowhy.gcm.causal_mechanisms import AdditiveNoiseModel, ClassifierFCM
17+
from dowhy.gcm.causal_mechanisms import AdditiveNoiseModel, ClassifierFCM, DiscreteAdditiveNoiseModel
1818
from dowhy.gcm.causal_models import CAUSAL_MECHANISM, ProbabilisticCausalModel, validate_causal_model_assignment
1919
from dowhy.gcm.ml import (
2020
ClassificationModel,
@@ -48,6 +48,7 @@
4848
auto_apply_encoders,
4949
auto_fit_encoders,
5050
is_categorical,
51+
is_discrete,
5152
set_random_seed,
5253
shape_into_2d,
5354
)
@@ -108,7 +109,43 @@ def add_model_performance(self, node, model: str, performance: str, metric_name:
108109
def __str__(self):
109110
summary_strings = []
110111

111-
summary_strings.append("Analyzed %d nodes." % len(list(self._nodes)))
112+
summary_strings.append(
113+
"When using this auto assignment function, the given data is used to automatically assign a causal "
114+
"mechanism to each node. Note that causal mechanisms can also be customized and assigned manually.\n"
115+
"The following types of causal mechanisms are considered for the automatic selection:"
116+
)
117+
summary_strings.append("\nIf root node:")
118+
summary_strings.append(
119+
"An empirical distribution, i.e., the distribution is represented by randomly sampling from the provided "
120+
"data. This provides a flexible and non-parametric way to model the marginal distribution and is valid for "
121+
"all types of data modalities."
122+
)
123+
summary_strings.append("\nIf non-root node and the data is continuous:")
124+
summary_strings.append(
125+
"Additive Noise Models (ANM) of the form X_i = f(PA_i) + N_i, where PA_i are the "
126+
"parents of X_i and the unobserved noise N_i is assumed to be independent of PA_i."
127+
"To select the best model for f, different regression models are evaluated and the model "
128+
"with the smallest mean squared error is selected."
129+
"Note that minimizing the mean squared error here is equivalent to selecting the best "
130+
"choice of an ANM."
131+
)
132+
summary_strings.append("\nIf non-root node and the data is discrete:")
133+
summary_strings.append(
134+
"Discrete Additive Noise Models have almost the same definition as non-discrete ANMs, but come with an "
135+
"additional constraint to return discrete values.\n"
136+
"Note that 'discrete' here refers to numerical values with an order. If the data is categorical, consider "
137+
"representing them as strings to ensure proper model selection."
138+
)
139+
summary_strings.append("\nIf non-root node and the data is categorical:")
140+
summary_strings.append(
141+
"A functional causal model based on a classifier, i.e., X_i = f(PA_i, N_i).\n"
142+
"Here, N_i follows a uniform distribution on [0, 1] and is used to randomly sample a "
143+
"class (category) using the conditional probability distribution produced by a "
144+
"classification model."
145+
"Here, different model classes are evaluated using the (negative) F1 score and the best"
146+
" performing model class is selected."
147+
)
148+
summary_strings.append("\nIn total, %d nodes were analyzed:" % len(list(self._nodes)))
112149

113150
for node in self._nodes:
114151
summary_strings.append("\n--- Node: %s" % node)
@@ -123,11 +160,13 @@ def __str__(self):
123160
for (model, performance, metric_name) in self._nodes[node]["model_performances"]:
124161
summary_strings.append("%s: %s" % (str(model()).replace("()", ""), str(performance)))
125162

126-
summary_strings.append(
127-
"Based on the type of causal mechanism, the model with the lowest metric value "
128-
"represents the best choice."
129-
)
130-
163+
summary_strings.append(
164+
"\n===Note===\nNote, based on the selected auto assignment quality, the set of " "evaluated models changes."
165+
)
166+
summary_strings.append(
167+
"For more insights toward the quality of the fitted graphical causal model, consider "
168+
"using the evaluate_causal_model function after fitting the causal mechanisms."
169+
)
131170
return "\n".join(summary_strings)
132171

133172

@@ -137,26 +176,86 @@ def assign_causal_mechanisms(
137176
quality: AssignmentQuality = AssignmentQuality.GOOD,
138177
override_models: bool = False,
139178
) -> AutoAssignmentSummary:
140-
"""Automatically assigns appropriate causal models. If causal models are already assigned to nodes and
141-
override_models is set to False, this function only validates the assignments with respect to the graph structure.
142-
Here, the validation checks whether root nodes have StochasticModels and non-root ConditionalStochasticModels
143-
assigned.
179+
"""Automatically assigns appropriate causal mechanisms to nodes. If causal mechanisms are already assigned to nodes
180+
and override_models is set to False, this function only validates the assignments with respect to the graph
181+
structure. This is, the validation checks whether root nodes have StochasticModels and non-root
182+
ConditionalStochasticModels assigned.
183+
184+
The following types of causal mechanisms are considered for the automatic selection:
185+
186+
If root node:
187+
An empirical distribution, i.e., the distribution is represented by randomly sampling from the provided data.
188+
This provides a flexible and non-parametric way to model the marginal distribution and is valid for all types of
189+
data modalities.
190+
191+
If non-root node and the data is continuous:
192+
Additive Noise Models (ANM) of the form X_i = f(PA_i) + N_i, where PA_i are the parents of X_i and the unobserved
193+
noise N_i is assumed to be independent of PA_i. To select the best model for f, different regression models are
194+
evaluated and the model with the smallest mean squared error is selected. Note that minimizing the mean squared
195+
error here is equivalent to selecting the best choice of an ANM.
196+
197+
If non-root node and the data is discrete:
198+
Discrete Additive Noise Models have almost the same definition as non-discrete ANMs, but come with an additional
199+
constraint to return discrete values. Note that 'discrete' here refers to numerical values with an order. If the
200+
data is categorical, consider representing them as strings to ensure proper model selection.
201+
202+
If non-root node and the data is categorical:
203+
A functional causal model based on a classifier, i.e., X_i = f(PA_i, N_i).
204+
Here, N_i follows a uniform distribution on [0, 1] and is used to randomly sample a class (category) using the
205+
conditional probability distribution produced by a classification model. Here, different model classes are evaluated
206+
using the (negative) F1 score and the best performing model class is selected.
207+
208+
The current model zoo is:
209+
210+
With "GOOD" quality:
211+
Numerical:
212+
- Linear Regressor
213+
- Linear Regressor with polynomial features
214+
- Histogram Gradient Boost Regressor
215+
216+
Categorical:
217+
- Logistic Regressor
218+
- Logistic Regressor with polynomial features
219+
- Histogram Gradient Boost Classifier
220+
221+
With "BETTER" quality:
222+
Numerical:
223+
- Linear Regressor
224+
- Linear Regressor with polynomial features
225+
- Gradient Boost Regressor
226+
- Ridge Regressor
227+
- Lasso Regressor
228+
- Random Forest Regressor
229+
- Support Vector Regressor
230+
- Extra Trees Regressor
231+
- KNN Regressor
232+
- Ada Boost Regressor
233+
234+
Categorical:
235+
- Logistic Regressor
236+
- Logistic Regressor with polynomial features
237+
- Histogram Gradient Boost Classifier
238+
- Random Forest Classifier
239+
- Extra Trees Classifier
240+
- Support Vector Classifier
241+
- KNN Classifier
242+
- Gaussian Naive Bayes Classifier
243+
- Ada Boost Classifier
244+
245+
With "BEST" quality:
246+
An auto ML model based on AutoGluon (optional dependency, needs to be installed).
144247
145248
:param causal_model: The causal model to whose nodes to assign causal models.
146249
:param based_on: Jointly sampled data corresponding to the nodes of the given graph.
147250
:param quality: AssignmentQuality for the automatic model selection and model accuracy. This changes the type of
148-
prediction model and time spent on the selection. Options are:
149-
- AssignmentQuality.GOOD: Compares a linear, polynomial and gradient boost model on small test-training split
150-
of the data. The best performing model is then selected.
251+
prediction model and time spent on the selection. See the docstring for a list of potential models.
252+
The options for the quality are:
253+
- AssignmentQuality.GOOD: Only a small set of models are evaluated.
151254
Model selection speed: Fast
152255
Model training speed: Fast
153256
Model inference speed: Fast
154257
Model accuracy: Medium
155-
- AssignmentQuality.BETTER: Compares multiple model types and uses the one with the best performance
156-
averaged over multiple splits of the training data. By default, the model with the smallest root mean
157-
squared error is selected for regression problems and the model with the highest F1 score is selected for
158-
classification problems. For a list of possible models, see _LIST_OF_POTENTIAL_REGRESSORS_BETTER and
159-
_LIST_OF_POTENTIAL_CLASSIFIERS_BETTER, respectively.
258+
- AssignmentQuality.BETTER: A larger set of models are evaluated.
160259
Model selection speed: Medium
161260
Model training speed: Fast
162261
Model inference speed: Fast
@@ -168,8 +267,8 @@ def assign_causal_mechanisms(
168267
Model training speed: Slow
169268
Model inference speed: Slow-Medium
170269
Model accuracy: Best
171-
:param override_models: If set to True, existing model assignments are replaced with automatically selected
172-
ones. If set to False, the assigned models are only validated with respect to the graph
270+
:param override_models: If set to True, existing mechanism assignments are replaced with automatically selected
271+
ones. If set to False, the assigned mechanisms are only validated with respect to the graph
173272
structure.
174273
:return: A summary object containing details about the model selection process.
175274
"""
@@ -179,7 +278,8 @@ def assign_causal_mechanisms(
179278
if not override_models and CAUSAL_MECHANISM in causal_model.graph.nodes[node]:
180279
auto_assignment_summary.add_node_log_message(
181280
node,
182-
"Node %s already has a model assigned and the override parameter is False. Skipping this node." % node,
281+
"Node %s already has a causal mechanism assigned and the override parameter is False. Skipping this "
282+
"node." % node,
183283
)
184284
validate_causal_model_assignment(causal_model.graph, node)
185285
continue
@@ -189,16 +289,36 @@ def assign_causal_mechanisms(
189289
if is_root_node(causal_model.graph, node):
190290
auto_assignment_summary.add_node_log_message(
191291
node,
192-
"Node %s is a root node. Assigning '%s' to the node representing the marginal distribution."
292+
"Node %s is a root node. Therefore, assigning '%s' to the node representing the marginal distribution."
193293
% (node, causal_model.causal_mechanism(node)),
194294
)
195295
else:
296+
data_type = "continuous"
297+
if isinstance(causal_model.causal_mechanism(node), ClassifierFCM):
298+
data_type = "categorical"
299+
elif isinstance(causal_model.causal_mechanism(node), DiscreteAdditiveNoiseModel):
300+
data_type = "discrete"
301+
196302
auto_assignment_summary.add_node_log_message(
197303
node,
198-
"Node %s is a non-root node. Assigning '%s' to the node." % (node, causal_model.causal_mechanism(node)),
304+
"Node %s is a non-root node with %s data. Assigning '%s' to the node."
305+
% (
306+
node,
307+
data_type,
308+
causal_model.causal_mechanism(node),
309+
),
199310
)
200311

201-
if isinstance(causal_model.causal_mechanism(node), AdditiveNoiseModel):
312+
if isinstance(causal_model.causal_mechanism(node), DiscreteAdditiveNoiseModel):
313+
auto_assignment_summary.add_node_log_message(
314+
node,
315+
"This represents the discrete causal relationship as "
316+
+ str(node)
317+
+ " := f("
318+
+ ",".join([str(parent) for parent in get_ordered_predecessors(causal_model.graph, node)])
319+
+ ") + N.",
320+
)
321+
elif isinstance(causal_model.causal_mechanism(node), AdditiveNoiseModel):
202322
auto_assignment_summary.add_node_log_message(
203323
node,
204324
"This represents the causal relationship as "
@@ -230,16 +350,21 @@ def assign_causal_mechanism_node(
230350
causal_model.set_causal_mechanism(node, EmpiricalDistribution())
231351
model_performances = []
232352
else:
353+
node_data = based_on[node].to_numpy()
354+
233355
best_model, model_performances = select_model(
234356
based_on[get_ordered_predecessors(causal_model.graph, node)].to_numpy(),
235-
based_on[node].to_numpy(),
357+
node_data,
236358
quality,
237359
)
238360

239361
if isinstance(best_model, ClassificationModel):
240362
causal_model.set_causal_mechanism(node, ClassifierFCM(best_model))
241363
else:
242-
causal_model.set_causal_mechanism(node, AdditiveNoiseModel(best_model))
364+
if is_discrete(node_data):
365+
causal_model.set_causal_mechanism(node, DiscreteAdditiveNoiseModel(best_model))
366+
else:
367+
causal_model.set_causal_mechanism(node, AdditiveNoiseModel(best_model))
243368

244369
return model_performances
245370

@@ -263,7 +388,7 @@ def select_model(
263388
elif model_selection_quality == AssignmentQuality.GOOD:
264389
list_of_regressor = list(_LIST_OF_POTENTIAL_REGRESSORS_GOOD)
265390
list_of_classifier = list(_LIST_OF_POTENTIAL_CLASSIFIERS_GOOD)
266-
model_selection_splits = 2
391+
model_selection_splits = 5
267392
elif model_selection_quality == AssignmentQuality.BETTER:
268393
list_of_regressor = list(_LIST_OF_POTENTIAL_REGRESSORS_BETTER)
269394
list_of_classifier = list(_LIST_OF_POTENTIAL_CLASSIFIERS_BETTER)

0 commit comments

Comments
 (0)