Skip to content

Commit 1009ddc

Browse files
committed
Extend GCM model evaluation by additional metrics
In addition to CRPS and depending on the node data type, it now also reports the MSE, NMSE, R2 and F1 score. Signed-off-by: Patrick Bloebaum <bloebp@amazon.com>
1 parent bd4f95f commit 1009ddc

File tree

3 files changed

+222
-106
lines changed

3 files changed

+222
-106
lines changed

docs/source/user_guide/modeling_gcm/model_evaluation.rst

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -109,20 +109,31 @@ performance and whether our assumptions hold:
109109
110110
==== Evaluation of Causal Mechanisms ====
111111
Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
112-
Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.
112+
Non-root nodes are mainly evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. In addition, the mean squared error (MSE), the normalized MSE (NMSE), the R2 coefficient and the F1 score (for categorical nodes) is reported.
113113
114-
--- Node X: The KL divergence between generated and observed distribution is 0.020548269898818708.
114+
--- Node X
115+
- The KL divergence between generated and observed distribution is 0.009626590006593095.
115116
The estimated KL divergence indicates an overall very good representation of the data distribution.
116117
117-
--- Node Y: The normalized CRPS of this node is 0.26169914525652427.
118+
--- Node Y
119+
- The MSE is 0.9757997114620423.
120+
- The NMSE is 0.43990166981441525.
121+
- The R2 coefficient is 0.8061235344428738.
122+
- The normalized CRPS is 0.25017606839653783.
118123
The estimated CRPS indicates a good model performance.
124+
The mechanism is better or equally good than all 7 baseline mechanisms.
119125
120-
--- Node Z: The normalized CRPS of this node is 0.08497732548860475.
126+
--- Node Z
127+
- The MSE is 1.0203244742317465.
128+
- The NMSE is 0.14823906495213202.
129+
- The R2 coefficient is 0.9779316094447573.
130+
- The normalized CRPS is 0.08426403180533645.
121131
The estimated CRPS indicates a very good model performance.
132+
The mechanism is better or equally good than all 7 baseline mechanisms.
122133
123134
==== Evaluation of Invertible Functional Causal Model Assumption ====
124135
125-
--- The model assumption for node Y is not rejected with a p-value of 0.9261751353508025 (after potential adjustment) and a significance level of 0.05.
136+
--- The model assumption for node Y is not rejected with a p-value of 1.0 (after potential adjustment) and a significance level of 0.05.
126137
This implies that the model assumption might be valid.
127138
128139
--- The model assumption for node Z is not rejected with a p-value of 1.0 (after potential adjustment) and a significance level of 0.05.
@@ -131,7 +142,7 @@ performance and whether our assumptions hold:
131142
Note that these results are based on statistical independence tests, and the fact that the assumption was not rejected does not necessarily imply that it is correct. There is just no evidence against it.
132143
133144
==== Evaluation of Generated Distribution ====
134-
The overall average KL divergence between the generated and observed distribution is 0.04045436327952057
145+
The overall average KL divergence between the generated and observed distribution is 0.0017936403551594468
135146
The estimated KL divergence indicates an overall very good representation of the data distribution.
136147
137148
==== Evaluation of the Causal Graph Structure ====
@@ -156,8 +167,11 @@ performance and whether our assumptions hold:
156167
As we see, we get a detailed overview of different evaluations:
157168

158169
**Evaluation of Causal Mechanisms:** Evaluation of the causal mechanisms with respect to their model performance.
159-
The performance of non-root nodes is measured using the Continuous Ranked Probability Score (CRPS), and the performance
160-
of root nodes is measured using the KL divergence between the generated and observed data distributions.
170+
For non-root nodes, the most important measure is the Continuous Ranked Probability Score (CRPS), which provides
171+
insights into the mechanism's accuracy and its calibration as a probabilistic model. It further lists other metrics
172+
such as the mean squared error (MSE), the MSE normalized by the variance (denoted as NMSE), the R2 coefficient and, in
173+
the case of categorical variables, the F1 score.
174+
If the node is a root node, the KL divergence between the generated and observed data distributions is measured.
161175

162176
Optionally, we can set the `compare_mechanism_baselines` parameter to `True` in order
163177
to compare the mechanisms with some baseline models. This gives us better insights into how the mechanisms perform in

0 commit comments

Comments
 (0)