From 1009ddc04b8c6aeef98284bf3829def3d1b79fc0 Mon Sep 17 00:00:00 2001
From: Patrick Bloebaum <bloebp@amazon.com>
Date: Tue, 21 Nov 2023 11:55:59 -0800
Subject: [PATCH] Extend GCM model evaluation by additional metrics

In addition to CRPS and depending on the node data type, it now also reports the MSE, NMSE, R2 and F1 score.

Signed-off-by: Patrick Bloebaum <bloebp@amazon.com>
---
 .../modeling_gcm/model_evaluation.rst         |  30 ++-
 dowhy/gcm/model_evaluation.py                 | 191 +++++++++++-------
 tests/gcm/test_model_evaluation.py            | 107 +++++++---
 3 files changed, 222 insertions(+), 106 deletions(-)

diff --git a/docs/source/user_guide/modeling_gcm/model_evaluation.rst b/docs/source/user_guide/modeling_gcm/model_evaluation.rst
index 80101a9a47..c7259abbe8 100644
--- a/docs/source/user_guide/modeling_gcm/model_evaluation.rst
+++ b/docs/source/user_guide/modeling_gcm/model_evaluation.rst
@@ -109,20 +109,31 @@ performance and whether our assumptions hold:
 
     ==== Evaluation of Causal Mechanisms ====
     Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
-    Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.
+    Non-root nodes are mainly evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. In addition, the mean squared error (MSE), the normalized MSE (NMSE), the R2 coefficient and the F1 score (for categorical nodes) is reported.
 
-    --- Node X: The KL divergence between generated and observed distribution is 0.020548269898818708.
+    --- Node X
+    - The KL divergence between generated and observed distribution is 0.009626590006593095.
     The estimated KL divergence indicates an overall very good representation of the data distribution.
 
-    --- Node Y: The normalized CRPS of this node is 0.26169914525652427.
+    --- Node Y
+    - The MSE is 0.9757997114620423.
+    - The NMSE is 0.43990166981441525.
+    - The R2 coefficient is 0.8061235344428738.
+    - The normalized CRPS is 0.25017606839653783.
     The estimated CRPS indicates a good model performance.
+    The mechanism is better or equally good than all 7 baseline mechanisms.
 
-    --- Node Z: The normalized CRPS of this node is 0.08497732548860475.
+    --- Node Z
+    - The MSE is 1.0203244742317465.
+    - The NMSE is 0.14823906495213202.
+    - The R2 coefficient is 0.9779316094447573.
+    - The normalized CRPS is 0.08426403180533645.
     The estimated CRPS indicates a very good model performance.
+    The mechanism is better or equally good than all 7 baseline mechanisms.
 
     ==== Evaluation of Invertible Functional Causal Model Assumption ====
 
-    --- The model assumption for node Y is not rejected with a p-value of 0.9261751353508025 (after potential adjustment) and a significance level of 0.05.
+    --- The model assumption for node Y is not rejected with a p-value of 1.0 (after potential adjustment) and a significance level of 0.05.
     This implies that the model assumption might be valid.
 
     --- The model assumption for node Z is not rejected with a p-value of 1.0 (after potential adjustment) and a significance level of 0.05.
@@ -131,7 +142,7 @@ performance and whether our assumptions hold:
     Note that these results are based on statistical independence tests, and the fact that the assumption was not rejected does not necessarily imply that it is correct. There is just no evidence against it.
 
     ==== Evaluation of Generated Distribution ====
-    The overall average KL divergence between the generated and observed distribution is 0.04045436327952057
+    The overall average KL divergence between the generated and observed distribution is 0.0017936403551594468
     The estimated KL divergence indicates an overall very good representation of the data distribution.
 
     ==== Evaluation of the Causal Graph Structure ====
@@ -156,8 +167,11 @@ performance and whether our assumptions hold:
 As we see, we get a detailed overview of different evaluations:
 
 **Evaluation of Causal Mechanisms:** Evaluation of the causal mechanisms with respect to their model performance.
-The performance of non-root nodes is measured using the Continuous Ranked Probability Score (CRPS), and the performance
-of root nodes is measured using the KL divergence between the generated and observed data distributions.
+For non-root nodes, the most important measure is the Continuous Ranked Probability Score (CRPS), which provides
+insights into the mechanism's accuracy and its calibration as a probabilistic model. It further lists other metrics
+such as the mean squared error (MSE), the MSE normalized by the variance (denoted as NMSE), the R2 coefficient and, in
+the case of categorical variables, the F1 score.
+If the node is a root node, the KL divergence between the generated and observed data distributions is measured.
 
 Optionally, we can set the `compare_mechanism_baselines` parameter to `True` in order
 to compare the mechanisms with some baseline models. This gives us better insights into how the mechanisms perform in
diff --git a/dowhy/gcm/model_evaluation.py b/dowhy/gcm/model_evaluation.py
index 857251cff4..604372f53d 100644
--- a/dowhy/gcm/model_evaluation.py
+++ b/dowhy/gcm/model_evaluation.py
@@ -8,7 +8,7 @@
 from joblib import Parallel, delayed
 from numpy.matlib import repmat
 from scipy.stats import mode
-from sklearn.metrics import f1_score, mean_squared_error
+from sklearn.metrics import f1_score, mean_squared_error, r2_score
 from sklearn.model_selection import KFold
 from statsmodels.stats.multitest import multipletests
 from tqdm import tqdm
@@ -135,11 +135,40 @@ def __init__(
         self.n_jobs = n_jobs
 
 
+@dataclass
+class MechanismPerformanceResult:
+    def __init__(
+        self,
+        node_name: Any,
+        is_root: bool,
+        crps: Optional[float],
+        kl_divergence: Optional[float],
+        mse: Optional[float],
+        nmse: Optional[float],
+        r2: Optional[float],
+        f1: Optional[float],
+        count_better_performance: Optional[int],
+        best_baseline_model: Optional[str],
+        total_number_baselines: int,
+        best_baseline_performance: Optional[float],
+    ):
+        self.node_name = node_name
+        self.is_root = is_root
+        self.crps = crps
+        self.kl_divergence = kl_divergence
+        self.mse = mse
+        self.nmse = nmse
+        self.r2 = r2
+        self.f1 = f1
+        self.count_better_performance = count_better_performance
+        self.best_baseline_model = best_baseline_model
+        self.total_number_baselines = total_number_baselines
+        self.best_baseline_performance = best_baseline_performance
+
+
 @dataclass
 class CausalModelEvaluationResult:
-    model_performances: Optional[
-        Dict[Any, Tuple[bool, float, Optional[int], Optional[str], int, Optional[float]]]
-    ] = None
+    mechanism_performances: Optional[Dict[str, MechanismPerformanceResult]] = None
     pnl_assumptions: Optional[Dict[Any, Tuple[float, str, Optional[float]]]] = None
     graph_falsification: Optional[EvaluationResult] = None
     overall_kl_divergence: Optional[float] = None
@@ -147,7 +176,7 @@ class CausalModelEvaluationResult:
 
     def __str__(self):
         summary_string = "Evaluated"
-        if self.model_performances is not None:
+        if self.mechanism_performances is not None:
             summary_string += " the performance of the causal mechanisms"
         if self.pnl_assumptions is not None:
             summary_string += " and the invertibility assumption of the causal mechanisms"
@@ -159,43 +188,57 @@ def __str__(self):
         summary_string += ". The results are as follows:"
         summary_strings = [summary_string]
 
-        if self.model_performances is not None:
+        if self.mechanism_performances is not None:
             summary_strings.append("\n==== Evaluation of Causal Mechanisms ====")
             summary_strings.append(
                 "Root nodes are evaluated based on the KL divergence between the generated "
                 "and the observed distribution."
             )
             summary_strings.append(
-                "Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score "
+                "Non-root nodes are mainly evaluated based on the (normalized) Continuous Ranked Probability Score "
                 "(CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic "
                 "predictions. Since the causal mechanisms produce conditional distributions, this "
-                "should give some insights into their performance and calibration. However, note that many algorithms "
-                "are still relatively robust against poor model performances."
+                "should give some insights into their performance and calibration. In addition, the mean squared error "
+                "(MSE), the normalized MSE (NMSE), the R2 coefficient and the F1 score (for categorical nodes) is "
+                "reported."
             )
 
-            for node in self.model_performances:
-                if self.model_performances[node][0]:
+            for mechanism_performance in self.mechanism_performances.values():
+                summary_strings.append("\n--- Node %s" % mechanism_performance.node_name)
+                if mechanism_performance.kl_divergence is not None:
                     summary_strings.append(
-                        "\n--- Node %s: The KL divergence between generated and observed distribution is %s."
-                        % (node, self.model_performances[node][1])
+                        "- The KL divergence between generated and observed distribution is %s."
+                        % mechanism_performance.kl_divergence
                     )
-                    summary_strings.append(_get_kl_divergence_interpretation_string(self.model_performances[node][1]))
-                else:
                     summary_strings.append(
-                        "\n--- Node %s: The normalized CRPS of this node is %s."
-                        % (node, self.model_performances[node][1])
+                        _get_kl_divergence_interpretation_string(mechanism_performance.kl_divergence)
                     )
-                    summary_strings.append(_get_crps_interpretation_string(self.model_performances[node][1]))
 
-                    if self.model_performances[node][2] is not None:
-                        summary_strings.append(
-                            _get_baseline_model_interpretation_string(
-                                self.model_performances[node][2],
-                                self.model_performances[node][4],
-                                self.model_performances[node][3],
-                                self.model_performances[node][5],
-                            )
+                if mechanism_performance.mse is not None:
+                    summary_strings.append("- The MSE is %s." % mechanism_performance.mse)
+
+                if mechanism_performance.nmse is not None:
+                    summary_strings.append("- The NMSE is %s." % mechanism_performance.nmse)
+
+                if mechanism_performance.r2 is not None:
+                    summary_strings.append("- The R2 coefficient is %s." % mechanism_performance.r2)
+
+                if mechanism_performance.f1 is not None:
+                    summary_strings.append("- The F1 score is %s." % mechanism_performance.f1)
+
+                if mechanism_performance.crps is not None:
+                    summary_strings.append("- The normalized CRPS is %s." % mechanism_performance.crps)
+                    summary_strings.append(_get_crps_interpretation_string(mechanism_performance.crps))
+
+                if mechanism_performance.total_number_baselines > 0:
+                    summary_strings.append(
+                        _get_baseline_model_interpretation_string(
+                            mechanism_performance.count_better_performance,
+                            mechanism_performance.total_number_baselines,
+                            mechanism_performance.best_baseline_model,
+                            mechanism_performance.best_baseline_performance,
                         )
+                    )
 
         if self.pnl_assumptions is not None:
             summary_strings.append("\n==== Evaluation of Invertible Functional Causal Model Assumption ====")
@@ -268,9 +311,11 @@ def evaluate_causal_model(
     Evaluation of Causal Mechanisms:
     The quality of the causal mechanisms is assessed using k-fold cross validation. This means that the models are trained
     from scratch multiple times, which might take a significant amount of time for larger models. Within each fold, the models
-    are assessed using the (normalized) continuous ranked probability score (CRPS). The normalization is with respect to
-    the standard deviation of the target values. Optionally, the mechanisms are compared with baseline models to see if
-    they are performing significantly better or equally good.
+    are assessed by different metrics. For all models, the continuous ranked probability score (CRPS) normalized by the
+    standard deviation is estimated, an important metric that provides insights to the model performance as well as its
+    calibration. Further, if the node is numerical, the mean squared error (MSE), the normalized MSE (normalized by
+    the variance) and the R2 coefficient is computed. In case of categorical nodes, the F1 score is computed instead.
+    Optionally, the mechanisms' CRPS are compared with baseline models to see if there are baseline models performing significantly better.
 
     Evaluation of Invertible Functional Causal Model Assumption:
     Invertible causal mechanisms rely on the assumption that the inputs are independent of the reconstructed noise.
@@ -317,7 +362,7 @@ def evaluate_causal_model(
         data = data[np.random.choice(data.shape[0], data.shape[0], replace=False)]
 
     if evaluate_causal_mechanisms:
-        evaluation_result.model_performances = _evaluate_model_performances(
+        evaluation_result.mechanism_performances = _evaluate_model_performances(
             causal_model,
             data,
             compare_mechanism_baselines,
@@ -377,7 +422,7 @@ def evaluate_node(node_name, random_seed):
         set_random_seed(random_seed)
 
         node_data = data[node_name].to_numpy()
-        metric_evaluations = []
+        metric_evaluations = {"CRPS": [], "KL": [], "MSE": [], "NMSE": [], "R2": [], "F1": []}
         baseline_crps = {}
         categorical = is_categorical(node_data)
 
@@ -386,7 +431,7 @@ def evaluate_node(node_name, random_seed):
                 tmp_causal_mechanism = causal_model.causal_mechanism(node_name).clone()
                 tmp_causal_mechanism.fit(node_data[training_indices])
 
-                metric_evaluations.append(
+                metric_evaluations["KL"].append(
                     auto_estimate_kl_divergence(
                         tmp_causal_mechanism.draw_samples(len(test_indices)), node_data[test_indices]
                     )
@@ -398,10 +443,24 @@ def evaluate_node(node_name, random_seed):
                 tmp_causal_mechanism = causal_model.causal_mechanism(node_name).clone()
                 tmp_causal_mechanism.fit(parent_data[training_indices], node_data[training_indices])
 
-                metric_evaluations.append(
+                metric_evaluations["CRPS"].append(
                     crps(parent_data[test_indices], node_data[test_indices], tmp_causal_mechanism.draw_samples)
                 )
 
+                conditional_expectations = _estimate_conditional_expectations(
+                    tmp_causal_mechanism, parent_data[test_indices], is_categorical, 50
+                )
+                if categorical:
+                    metric_evaluations["F1"].append(
+                        f1_score(node_data[test_indices], conditional_expectations, average="macro", zero_division=0)
+                    )
+                else:
+                    metric_evaluations["MSE"].append(
+                        mean_squared_error(node_data[test_indices], conditional_expectations)
+                    )
+                    metric_evaluations["NMSE"].append(nmse(node_data[test_indices], conditional_expectations))
+                    metric_evaluations["R2"].append(r2_score(node_data[test_indices], conditional_expectations))
+
                 if not compare_mechanism_baselines:
                     continue
 
@@ -442,22 +501,24 @@ def evaluate_node(node_name, random_seed):
                             crps(parent_data[test_indices], node_data[test_indices], baseline_mechanism.draw_samples)
                         )
 
-        if len(metric_evaluations) == 0:
-            mean_metric = None
-        else:
-            mean_metric = float(np.mean(metric_evaluations))
+        for metric in metric_evaluations:
+            metric_evaluations[metric] = (
+                float(np.mean(metric_evaluations[metric])) if len(metric_evaluations[metric]) > 0 else None
+            )
 
         count_better_performance = None
         best_baseline_performance = None
         best_baseline_model = None
+        total_number_baselines = 0
 
         if compare_mechanism_baselines:
             count_better_performance = 0
 
             for k in baseline_crps:
+                total_number_baselines += 1
                 baseline_crps[k] = float(np.mean(baseline_crps[k]))
 
-                if mean_metric - baseline_crps[k] > 0.05:
+                if metric_evaluations["CRPS"] - baseline_crps[k] > 0.05:
                     count_better_performance += 1
 
                 if best_baseline_performance is None:
@@ -468,14 +529,19 @@ def evaluate_node(node_name, random_seed):
                     best_baseline_model = k
                     best_baseline_performance = baseline_crps[k]
 
-        return (
-            node_name,
-            is_root_node(causal_model.graph, node_name),
-            mean_metric,
-            count_better_performance,
-            best_baseline_model,
-            len(baseline_crps),
-            best_baseline_performance,
+        return MechanismPerformanceResult(
+            node_name=node_name,
+            is_root=is_root_node(causal_model.graph, node_name),
+            kl_divergence=metric_evaluations["KL"],
+            crps=metric_evaluations["CRPS"],
+            mse=metric_evaluations["MSE"],
+            nmse=metric_evaluations["NMSE"],
+            r2=metric_evaluations["R2"],
+            f1=metric_evaluations["F1"],
+            count_better_performance=count_better_performance,
+            best_baseline_model=best_baseline_model,
+            total_number_baselines=total_number_baselines,
+            best_baseline_performance=best_baseline_performance,
         )
 
     random_seeds = np.random.randint(np.iinfo(np.int32).max, size=len(causal_model.graph.nodes))
@@ -492,25 +558,7 @@ def evaluate_node(node_name, random_seed):
         )
     )
 
-    for (
-        node_name,
-        root_node,
-        mean_metric,
-        count_better_performance,
-        best_baseline_model,
-        num_baselines,
-        best_baseline_performance,
-    ) in all_results:
-        model_performances[node_name] = (
-            root_node,
-            mean_metric,
-            count_better_performance,
-            best_baseline_model,
-            num_baselines,
-            best_baseline_performance,
-        )
-
-    return model_performances
+    return {performance_result.node_name: performance_result for performance_result in all_results}
 
 
 def _evaluate_invertibility_assumptions(
@@ -610,18 +658,23 @@ def _estimate_conditional_expectations(
             return np.array(modes[0].tolist())
 
 
-def nrmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
-    """Estimates the Normalized Root Mean Squared Error (NRMSE) based on the given samples. This is, the root mean
+def nmse(y_true: np.ndarray, y_pred: np.ndarray, squared: bool = False) -> float:
+    """Estimates the Normalized Mean Squared Error (NMSE) based on the given samples. This is, the root mean
     squared error normalized by the variance of the observed values.
 
     :param y_true: Observed values.
     :param y_pred: Predicted values.
-    :return: The normalized RMSE.
+    :param squared: If True, returns the normalized MSE if False, it returns the normalized RMSE.
+    :return: The normalized MSE.
     """
     y_true = y_true.reshape(-1)
     y_pred = y_pred.reshape(-1)
 
-    return mean_squared_error(y_true, y_pred, squared=False) / np.std(y_true)
+    y_std = np.std(y_true)
+    if y_std == 0:
+        return mean_squared_error(y_true, y_pred, squared=squared)
+
+    return mean_squared_error(y_true, y_pred, squared=squared) / (np.var(y_true) if squared else y_std)
 
 
 def crps(
diff --git a/tests/gcm/test_model_evaluation.py b/tests/gcm/test_model_evaluation.py
index 2e7b56bbba..6ce9d00a47 100644
--- a/tests/gcm/test_model_evaluation.py
+++ b/tests/gcm/test_model_evaluation.py
@@ -27,7 +27,7 @@
     _evaluate_invertibility_assumptions,
     crps,
     evaluate_causal_model,
-    nrmse,
+    nmse,
 )
 
 
@@ -40,7 +40,8 @@ def test_given_good_fit_when_estimate_nrmse_then_returns_zero():
         noise_model=ScipyDistribution(stats.norm, loc=0, scale=0),
     )
 
-    assert nrmse(_estimate_conditional_expectations(mdl, X, False, 1), Y) == approx(0, abs=0.01)
+    assert nmse(Y, _estimate_conditional_expectations(mdl, X, False, 1), squared=True) == approx(0, abs=0.01)
+    assert nmse(Y, _estimate_conditional_expectations(mdl, X, False, 1), squared=False) == approx(0, abs=0.01)
 
 
 def test_given_bad_fit_when_estimate_nrmse_then_returns_high_value():
@@ -52,7 +53,8 @@ def test_given_bad_fit_when_estimate_nrmse_then_returns_high_value():
         noise_model=ScipyDistribution(stats.norm, loc=0, scale=0),
     )
 
-    assert nrmse(Y, _estimate_conditional_expectations(mdl, X, False, 1)) > 1
+    assert nmse(Y, _estimate_conditional_expectations(mdl, X, False, 1), squared=True) > 1
+    assert nmse(Y, _estimate_conditional_expectations(mdl, X, False, 1), squared=False) > 1
 
 
 def test_given_good_fit_but_noisy_data_when_estimate_nrmse_then_returns_expected_result():
@@ -64,8 +66,13 @@ def test_given_good_fit_but_noisy_data_when_estimate_nrmse_then_returns_expected
         noise_model=ScipyDistribution(stats.norm, loc=0, scale=1),
     )
 
-    # The RMSE should be 1 due to the variance of the noise. The NRMSE is accordingly 1 / std(Y).
-    assert nrmse(Y, _estimate_conditional_expectations(mdl, X, False, 1)) == approx(1 / np.std(Y), abs=0.05)
+    # The MSE should be 1 due to the variance of the noise. The RMSE is accordingly 1 / var(Y).
+    assert nmse(Y, _estimate_conditional_expectations(mdl, X, False, 1), squared=True) == approx(
+        1 / np.var(Y), abs=0.05
+    )
+    assert nmse(Y, _estimate_conditional_expectations(mdl, X, False, 1), squared=False) == approx(
+        1 / np.std(Y), abs=0.05
+    )
 
 
 def test_given_good_fit_with_deterministic_data_when_estimate_crps_then_returns_zero():
@@ -195,14 +202,35 @@ def test_given_continuous_data_only_when_evaluate_model_returns_expected_informa
     )
 
     assert summary.overall_kl_divergence == approx(0, abs=0.05)
-    assert summary.model_performances["X0"][1] == approx(0, abs=0.15)
-    assert summary.model_performances["X1"][1] == approx(0, abs=0.15)
-    assert summary.model_performances["Y"][1] == approx(0.05, abs=0.02)  # CRPS
+
+    assert summary.mechanism_performances["X0"].kl_divergence == approx(0, abs=0.2)
+    assert summary.mechanism_performances["X0"].crps == None
+    assert summary.mechanism_performances["X0"].nmse == None
+    assert summary.mechanism_performances["X0"].r2 == None
+    assert summary.mechanism_performances["X0"].f1 == None
+    assert summary.mechanism_performances["X0"].total_number_baselines == 0
+
+    assert summary.mechanism_performances["X1"].kl_divergence == approx(0, abs=0.2)
+    assert summary.mechanism_performances["X1"].crps == None
+    assert summary.mechanism_performances["X1"].nmse == None
+    assert summary.mechanism_performances["X1"].r2 == None
+    assert summary.mechanism_performances["X1"].f1 == None
+    assert summary.mechanism_performances["X0"].total_number_baselines == 0
+
+    assert summary.mechanism_performances["Y"].kl_divergence == None
+    assert summary.mechanism_performances["Y"].crps == approx(0.05, abs=0.02)
+    assert summary.mechanism_performances["Y"].nmse == approx(0.07, abs=0.03)
+    assert summary.mechanism_performances["Y"].r2 == approx(1, abs=0.05)
+    assert summary.mechanism_performances["Y"].f1 == None
+    assert 0 < summary.mechanism_performances["Y"].total_number_baselines <= 2
+    assert summary.mechanism_performances["Y"].count_better_performance == 0
+
+    assert "X0" not in summary.pnl_assumptions
+    assert "X1" not in summary.pnl_assumptions
     assert not summary.pnl_assumptions["Y"][1]
     assert summary.pnl_assumptions["Y"][2] == 0.05
 
     summary.plot_falsification_histogram = False
-
     summary_string = str(summary)
 
     assert (
@@ -210,12 +238,15 @@ def test_given_continuous_data_only_when_evaluate_model_returns_expected_informa
 
 ==== Evaluation of Causal Mechanisms ====
 Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
-Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.
-
---- Node X0: The KL divergence between generated and observed distribution is """
+Non-root nodes are mainly evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. In addition, the mean squared error (MSE), the normalized MSE (NMSE), the R2 coefficient and the F1 score (for categorical nodes) is reported."""
         in summary_string
     )
-    assert "--- Node Y: The normalized CRPS of this node is " in summary_string
+    assert "--- Node X0\n" "- The KL divergence between generated and observed distribution is " in summary_string
+    assert "--- Node X1\n" "- The KL divergence between generated and observed distribution is " in summary_string
+    assert "--- Node Y\n" "- The MSE is " in summary_string
+    assert "- The NMSE is " in summary_string
+    assert "- The R2 coefficient is " in summary_string
+    assert "- The normalized CRPS is " in summary_string
     assert "The estimated CRPS indicates a very good model performance." in summary_string
 
     assert "The mechanism is better or equally good than all " in summary_string
@@ -264,14 +295,36 @@ def test_given_categorical_data_only_when_evaluate_model_returns_expected_inform
             ],
         ),
     )
+
     assert summary.overall_kl_divergence == approx(0, abs=0.05)
-    assert summary.model_performances["X0"][1] == approx(0, abs=0.15)
-    assert summary.model_performances["X1"][1] == approx(0, abs=0.15)
-    assert summary.model_performances["Y"][1] == approx(0.02, abs=0.05)  # CRPS
-    assert len(summary.pnl_assumptions) == 0
 
-    summary.plot_falsification_histogram = False
+    assert summary.mechanism_performances["X0"].kl_divergence == approx(0, abs=0.2)
+    assert summary.mechanism_performances["X0"].crps == None
+    assert summary.mechanism_performances["X0"].nmse == None
+    assert summary.mechanism_performances["X0"].r2 == None
+    assert summary.mechanism_performances["X0"].f1 == None
+    assert summary.mechanism_performances["X0"].total_number_baselines == 0
+
+    assert summary.mechanism_performances["X1"].kl_divergence == approx(0, abs=0.2)
+    assert summary.mechanism_performances["X1"].crps == None
+    assert summary.mechanism_performances["X1"].nmse == None
+    assert summary.mechanism_performances["X1"].r2 == None
+    assert summary.mechanism_performances["X1"].f1 == None
+    assert summary.mechanism_performances["X0"].total_number_baselines == 0
+
+    assert summary.mechanism_performances["Y"].kl_divergence == None
+    assert summary.mechanism_performances["Y"].crps == approx(0.02, abs=0.02)
+    assert summary.mechanism_performances["Y"].nmse == None
+    assert summary.mechanism_performances["Y"].r2 == None
+    assert summary.mechanism_performances["Y"].f1 == approx(0.97, abs=0.05)
+    assert 0 < summary.mechanism_performances["Y"].total_number_baselines <= 2
+    assert summary.mechanism_performances["Y"].count_better_performance == 0
+
+    assert "X0" not in summary.pnl_assumptions
+    assert "X1" not in summary.pnl_assumptions
+    assert "Y" not in summary.pnl_assumptions
 
+    summary.plot_falsification_histogram = False
     summary_string = str(summary)
 
     assert (
@@ -279,23 +332,19 @@ def test_given_categorical_data_only_when_evaluate_model_returns_expected_inform
 
 ==== Evaluation of Causal Mechanisms ====
 Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
-Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.
-
---- Node X0: The KL divergence between generated and observed distribution is """
+Non-root nodes are mainly evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. In addition, the mean squared error (MSE), the normalized MSE (NMSE), the R2 coefficient and the F1 score (for categorical nodes) is reported."""
         in summary_string
     )
-    assert "--- Node Y: The normalized CRPS of this node is " in summary_string
+    assert "--- Node X0\n" "- The KL divergence between generated and observed distribution is " in summary_string
+    assert "--- Node X1\n" "- The KL divergence between generated and observed distribution is " in summary_string
+    assert "--- Node Y\n" "- The F1 score is " in summary_string
+    assert "- The normalized CRPS is " in summary_string
     assert "The estimated CRPS indicates a very good model performance." in summary_string
+
     assert "The mechanism is better or equally good than all " in summary_string
     assert "==== Evaluation of Invertible Functional Causal Model Assumption ====" in summary_string
     assert "The causal model has no invertible causal models." in summary_string
-    assert (
-        (
-            """==== Evaluation of Generated Distribution ====
-The overall average KL divergence between the generated and observed distribution is"""
-        )
-        in summary_string
-    )
+    assert "==== Evaluation of Generated Distribution ====" in summary_string
     assert (
         "The estimated KL divergence indicates an overall very good representation of the data distribution"
         in summary_string