diff --git a/docs/source/example_notebooks/gcm_basic_example.ipynb b/docs/source/example_notebooks/gcm_basic_example.ipynb
index 46d7977309..20e0248bfe 100644
--- a/docs/source/example_notebooks/gcm_basic_example.ipynb
+++ b/docs/source/example_notebooks/gcm_basic_example.ipynb
@@ -51,21 +51,34 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "At this point we would normally load our dataset. For this introduction, we generate\n",
-    "some synthetic data instead. The API takes data in form of Pandas DataFrames:"
-   ],
+   "id": "85ea234d",
    "metadata": {
     "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
     "pycharm": {
      "name": "#%% md\n"
     }
    },
-   "id": "7ec5357c05109df9"
+   "source": [
+    "At this point we would normally load our dataset. For this introduction, we generate\n",
+    "some synthetic data instead. The API takes data in form of Pandas DataFrames:"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "71c06991",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
    "outputs": [],
    "source": [
     "import numpy as np, pandas as pd\n",
@@ -75,17 +88,20 @@
     "Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)\n",
     "data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))\n",
     "data.head()"
-   ],
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b24c2de4",
    "metadata": {
     "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
     "pycharm": {
-     "name": "#%%\n"
+     "name": "#%% md\n"
     }
    },
-   "id": "fe1b3eff33c5faf8"
-  },
-  {
-   "cell_type": "markdown",
    "source": [
     "Note how the columns X, Y, Z correspond to our nodes X, Y, Z in the graph constructed above. We can also see how the\n",
     "values of X influence the values of Y and how the values of Y influence the values of Z in that data set.\n",
@@ -94,29 +110,43 @@
     "models. Here, these mechanism can either be assigned manually if, for instance, prior knowledge about certain causal\n",
     "relationships are known or they can be assigned automatically using the `auto` module. For the latter,\n",
     "we simply call:"
-   ],
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2c342c9",
    "metadata": {
     "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
     "pycharm": {
-     "name": "#%% md\n"
+     "name": "#%%\n"
     }
    },
-   "id": "54e59a824d7fc2d9"
+   "outputs": [],
+   "source": [
+    "auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f5d2959-342a-4a2f-af49-3ba0e698f607",
+   "metadata": {},
+   "source": [
+    "Optionally, we can get more insights from the auto assignment process:"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "4f1493d8-9a0b-4ed5-97be-7b5f39cbd7d4",
+   "metadata": {},
    "outputs": [],
    "source": [
-    "gcm.auto.assign_causal_mechanisms(causal_model, data)"
-   ],
-   "metadata": {
-    "collapsed": false,
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "id": "ced986f744d8e333"
+    "print(auto_assignment_summary)"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -130,19 +160,30 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "outputs": [],
-   "source": [
-    "causal_model.set_causal_mechanism('X', gcm.EmpiricalDistribution())\n",
-    "causal_model.set_causal_mechanism('Y', gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))\n",
-    "causal_model.set_causal_mechanism('Z', gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))"
-   ],
+   "id": "5a50a60a",
    "metadata": {
     "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
     "pycharm": {
      "name": "#%%\n"
     }
    },
-   "id": "ae90eccaf892bb41"
+   "outputs": [],
+   "source": [
+    "causal_model.set_causal_mechanism('X', gcm.EmpiricalDistribution())\n",
+    "causal_model.set_causal_mechanism('Y', gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))\n",
+    "causal_model.set_causal_mechanism('Z', gcm.AdditiveNoiseModel(gcm.ml.create_linear_regressor()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0d1cbd1-2544-4b18-a2c9-ddd59cd69f94",
+   "metadata": {},
+   "source": [
+    "Here, we set node X to follow the empirical distribution we observed (nonparametric) and nodes Y and Z to follow an additive noise model where we explicitly set a linear relationship."
+   ]
   },
   {
    "cell_type": "markdown",
@@ -176,6 +217,46 @@
     "Fitting means, we learn the generative models of the variables in the SCM according to the data."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "766840ab-a353-4061-ae28-f9a9c451e490",
+   "metadata": {},
+   "source": [
+    "Once fitted, we can also obtain more insights into the model performances:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "671cd8d8-75b0-4019-b503-8aa83ad91615",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(gcm.evaluate_causal_model(causal_model, data))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "176d918a-47ed-4440-a925-51f6c19281da",
+   "metadata": {},
+   "source": [
+    "This summary tells us a few things:\n",
+    "- Our models are a good fit.\n",
+    "- The additive noise model assumption was not rejected.\n",
+    "- The generated distribution is very similar to the observed one.\n",
+    "- The causal graph structure was not rejected."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67cfaf23-1b90-4124-84cb-39ea67401a22",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-info\">\n",
+    "Note, this evaluation take some significant time depending on the model complexities, graph size and amount of data. For a speed-up, consider changing the evaluation parameters.\n",
+    "</div>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "fc324e13",
@@ -205,28 +286,29 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "This intervention says: \"I'll ignore any causal effects of X on Y, and set every value of Y\n",
-    "to 2.34.\" So the distribution of X will remain unchanged, whereas values of Y will be at a fixed\n",
-    "value and Z will respond according to its causal model."
-   ],
+   "id": "9976c633",
    "metadata": {
     "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
     "pycharm": {
      "name": "#%% md\n"
     }
    },
-   "id": "68a94da68fba303e"
+   "source": [
+    "This intervention says: \"I'll ignore any causal effects of X on Y, and set every value of Y to 2.34.\" So the distribution of X will remain unchanged, whereas values of Y will be at a fixed value and Z will respond according to its causal model."
+   ]
   },
   {
    "cell_type": "markdown",
+   "id": "d65e3319-58ea-4284-8810-c88b10d08448",
+   "metadata": {},
    "source": [
-    "DoWhy offers a wide range of causal questions that can be answered with GCMs. See the user guide or other notebooks for more examples."
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "1e7aed4828c843a9"
+    "<div class=\"alert alert-block alert-info\">\n",
+    "DoWhy offers a wide range of causal questions that can be answered with GCMs. See the user guide or other notebooks for more examples.\n",
+    "</div>"
+   ]
   }
  ],
  "metadata": {
@@ -245,7 +327,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.12"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,
diff --git a/docs/source/example_notebooks/gcm_counterfactual_medical_dry_eyes.ipynb b/docs/source/example_notebooks/gcm_counterfactual_medical_dry_eyes.ipynb
index 7fcd5efd45..6ba87320d7 100644
--- a/docs/source/example_notebooks/gcm_counterfactual_medical_dry_eyes.ipynb
+++ b/docs/source/example_notebooks/gcm_counterfactual_medical_dry_eyes.ipynb
@@ -88,6 +88,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Optionally, we can now also evaluate the fitted causal model:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(gcm.evaluate_causal_model(causal_model, medical_data))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This confirms the accuracy of our causal model.\n",
+    "\n",
     "Now returning to our original problem, let's load Alice's data, who happens to have a rare allergy (Condition = 1)."
    ]
   },
diff --git a/docs/source/example_notebooks/gcm_draw_samples.ipynb b/docs/source/example_notebooks/gcm_draw_samples.ipynb
index d840ccce4b..6ee23ff563 100644
--- a/docs/source/example_notebooks/gcm_draw_samples.ipynb
+++ b/docs/source/example_notebooks/gcm_draw_samples.ipynb
@@ -104,9 +104,7 @@
     }
    },
    "source": [
-    "If our modeling assumptions are correct, the generated data should now resemble the observed data distribution, i.e. the\n",
-    "generated samples correspond to the joint distribution we defined for our example data at the beginning. A quick\n",
-    "way to make sure of this is to estimate the KL-divergence between observed and generated distribution:"
+    "If our modeling assumptions are correct, the generated data should now resemble the observed data distribution, i.e. the generated samples correspond to the joint distribution we defined for our example data at the beginning. One way to make sure of this is to estimate the KL-divergence between observed and generated distribution. For this, we can make use of the evaluation module:"
    ]
   },
   {
@@ -124,25 +122,25 @@
    },
    "outputs": [],
    "source": [
-    "gcm.divergence.auto_estimate_kl_divergence(data.to_numpy(), generated_data.to_numpy())"
+    "print(gcm.evaluate_causal_model(causal_model, data, evaluate_causal_mechanisms=False, evaluate_invertibility_assumptions=False))"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "0ccbc2ad",
-   "metadata": {
-    "pycharm": {
-     "name": "#%% md\n"
-    }
-   },
+   "id": "9d12f880-685b-4110-b83d-3d9837b3223e",
+   "metadata": {},
    "source": [
-    "Here, we expect the divergence to be (very) small.\n",
-    "\n",
-    "**Note**: We **cannot** validate the correctness of a causal graph this way,\n",
-    "since any graph from a Markov equivalence class would be sufficient to generate data that is consistent with the observed one,\n",
-    "but only one particular graph would generate the correct interventional and counterfactual distributions. This is, seeing the example above,\n",
-    "X→Y→Z and X←Y←Z would generate the same observational distribution (since they encode the same conditionals), but only X→Y→Z would generate the\n",
-    "correct interventional distribution (e.g. when intervening on Y)."
+    "This confirms that the generated distribution is close to the observed one. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f430827d-fc75-4c23-8d2d-3d98605dd927",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-info\">\n",
+    "While the evaluation provides us insights toward the causal graph structure as well, we cannot confirm the graph structure, only reject it if we find inconsistencies between the dependencies of the observed structure and what the graph represents. In our case, we do not reject the DAG, but there are other equivalent DAGs that would not be rejected as well. To see this, consider the example above - X→Y→Z and X←Y←Z would generate the same observational distribution (since they encode the same conditionals), but only X→Y→Z would generate the correct interventional distribution (e.g., when intervening on Y).\n",
+    "</div>"
    ]
   }
  ],
@@ -162,9 +160,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
\ No newline at end of file
+}
diff --git a/docs/source/example_notebooks/gcm_rca_microservice_architecture.ipynb b/docs/source/example_notebooks/gcm_rca_microservice_architecture.ipynb
index 2aa53d0f09..59df41010f 100644
--- a/docs/source/example_notebooks/gcm_rca_microservice_architecture.ipynb
+++ b/docs/source/example_notebooks/gcm_rca_microservice_architecture.ipynb
@@ -79,7 +79,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Setting up the causal graph\n",
+    "## Setting up the causal model\n",
     "\n",
     "If we look at the `Website` node, it becomes apparent that the latency we experience there depends on the latencies of\n",
     "all downstream nodes. In particular, if one of the downstream nodes takes a long time, `Website` will also take a\n",
@@ -165,6 +165,39 @@
     "</div>"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before we contiue with the first scenario, let's first evaluate our causal model:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gcm.fit(causal_model, normal_data)\n",
+    "print(gcm.evaluate_causal_model(causal_model, normal_data))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This confirms the goodness of our causal model. However, we also see that for two nodes ('Product Service' and 'Caching Service'), the additive noise model assumption is violated. This also aligns with the data generation process, where these two nodes follow non-additive noise models. As we see in the following, most of the algorithms are still fairly robust against such violations or poor performance of the causal mechanism."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-info\">\n",
+    "For more detailed insights, set compare_mechanism_baselines to True. However, this will take significantly longer.\n",
+    "</div>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -181,10 +214,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -203,10 +233,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -233,10 +260,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -266,10 +290,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -564,7 +585,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,
diff --git a/docs/source/example_notebooks/gcm_supply_chain_dist_change.ipynb b/docs/source/example_notebooks/gcm_supply_chain_dist_change.ipynb
index bb8a41c34a..af9f454e79 100644
--- a/docs/source/example_notebooks/gcm_supply_chain_dist_change.ipynb
+++ b/docs/source/example_notebooks/gcm_supply_chain_dist_change.ipynb
@@ -70,6 +70,7 @@
   {
    "cell_type": "markdown",
    "metadata": {
+    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -152,6 +153,7 @@
   {
    "cell_type": "markdown",
    "metadata": {
+    "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
@@ -188,9 +190,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Attributing change\n",
-    "\n",
-    "We can now attribute the week-over-week change in the average value of *received* quantity using *Causality*. "
+    "Now, we can setup the causal model:"
    ]
   },
   {
@@ -211,8 +211,68 @@
     "causal_model = gcm.StructuralCausalModel(causal_graph)\n",
     "\n",
     "# Automatically assign appropriate causal models to each node in graph\n",
-    "gcm.auto.assign_causal_mechanisms(causal_model, data_week1)\n",
-    "\n",
+    "auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data_week1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Before we attributing the changes to the nodes, let's first take a look at the result of the auto assignment:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(auto_assignment_summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It seems most of the relationship can be well captured using a linear model. Let's further evaluate the assumed graph structure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gcm.falsify.falsify_graph(causal_graph, data_week1, n_permutations=20, plot_histogram=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since we do not reject the DAG, we consider our causal graph structure to be confirmed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Attributing change"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now attribute the week-over-week change in the average value of *received* quantity:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "# call the API for attributing change in the average value of `received`\n",
     "contributions = gcm.distribution_change(causal_model,\n",
     "                                        data_week1, \n",
@@ -327,7 +387,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.12"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,
diff --git a/docs/source/user_guide/intro.rst b/docs/source/user_guide/intro.rst
index f62109b7d6..c6c3907758 100644
--- a/docs/source/user_guide/intro.rst
+++ b/docs/source/user_guide/intro.rst
@@ -30,10 +30,12 @@ Testing validity of a causal analysis
 -------------------------------------
 Since causal tasks concern an interventional data distribution that is often not observed, we need special ways to evaluate the validity of a causal estimate. Methods like cross-validation from predictive machine learning do not work, unless we have access to samples from the interventional distribution. Therefore, for each causal task, a important part of the analysis is to test whether the obtained answer is valid. In DoWhy, we call this process *refutation*, which involves refuting or challenging the assumptions made by a causal analysis. Refutations are performed at two stages: after modeling the causal graph, and after completing the analysis for a task. 
 
-In the first stage, graph refutations test whether the  assumptions encoded in a given causal graph are valid. This is an important set of refutations  since all downstream analysis depends on the graph. These refutations are typically task-agnostic and we recommend running them to improve the quality of the assumed graph. DoWhy's functionality for refuting a causal graph is described in :doc:`modeling_causal_relations/refuting_causal_graph/index`. The second kind of refutations, estimate refutations, are conducted after the task analysis returns a causal estimate. These refutations test whether the analysis follows best practices, provides the correct answer under special test data, and how robust the final estimate is to violation of assumptions. Estimate refutations can help improve the robustness of an analysis or help choose between multiple candidate models in the analysis. We discuss estimate refutations in a separate chapter, :doc:`refuting_causal_estimates/index`. 
+In the first stage, graph refutations test whether the  assumptions encoded in a given causal graph are valid. This is an important set of refutations since all downstream analysis depends on the graph. These refutations are typically task-agnostic and we recommend running them to improve the quality of the assumed graph. DoWhy's functionality for refuting a causal graph is described in :doc:`modeling_causal_relations/refuting_causal_graph/index`. The second kind of refutations, estimate refutations, are conducted after the task analysis returns a causal estimate. These refutations test whether the analysis follows best practices, provides the correct answer under special test data, and how robust the final estimate is to violation of assumptions. Estimate refutations can help improve the robustness of an analysis or help choose between multiple candidate models in the analysis. We discuss estimate refutations in a separate chapter, :doc:`refuting_causal_estimates/index`.
 
 For an alternative approach of validating a given causal graph, see `Falsification of User-Given Directed Acyclic Graphs <../example_notebooks/gcm_falsify_dag.html>`_.
 
+For evaluating a graphical causal model, see :doc:`modeling_gcm/model_evaluation`.
+
 Who this user guide is for
 --------------------------
 If you are new to causal inference, this user guide helps you understand the difference causal tasks and provides examples on how to implement them using DoWhy.
diff --git a/docs/source/user_guide/modeling_gcm/graph_evaluation.png b/docs/source/user_guide/modeling_gcm/graph_evaluation.png
new file mode 100644
index 0000000000..da871892d0
Binary files /dev/null and b/docs/source/user_guide/modeling_gcm/graph_evaluation.png differ
diff --git a/docs/source/user_guide/modeling_gcm/index.rst b/docs/source/user_guide/modeling_gcm/index.rst
index a4d9a9a4e6..293660ee6c 100644
--- a/docs/source/user_guide/modeling_gcm/index.rst
+++ b/docs/source/user_guide/modeling_gcm/index.rst
@@ -96,6 +96,10 @@ Fitting means, we learn the generative models of the variables in the SCM accord
 
 The causal model is now ready to be used for :doc:`../causal_tasks/index`.
 
+Evaluating a fitted SCM
+-----------------------
+
+For evaluating the fitted model, see :doc:`modeling_gcm/model_evaluation`.
 
 Other related GCM topics
 ------------------------
@@ -109,5 +113,6 @@ Other related GCM topics
 
     graphical_causal_model_types
     draw_samples
+    model_evaluation
     customizing_model_assignment
     estimating_confidence_intervals
diff --git a/docs/source/user_guide/modeling_gcm/model_evaluation.rst b/docs/source/user_guide/modeling_gcm/model_evaluation.rst
new file mode 100644
index 0000000000..80101a9a47
--- /dev/null
+++ b/docs/source/user_guide/modeling_gcm/model_evaluation.rst
@@ -0,0 +1,183 @@
+Evaluate a GCM
+==============
+
+Modeling a graphical causal model (GCM) requires various assumptions and choices of models, all of which can influence
+the performance and accuracy of the model. Some examples of the required assumptions include:
+
+**Graph Structure:** Correctly specifying the causal directions between variables is crucial. For example, it is logical to
+say that rain causes the street to be wet, whereas asserting that a wet street causes rain does not make sense. If
+these relationships are modeled incorrectly, the resultant causal statements will be misleading. However, it is
+important to note that while this example is quite straightforward, models typically exhibit some level of robustness
+to misspecifications. Particularly, the impact of misspecification tends to be less severe in larger graphs.
+Additionally, the severity of misspecifications can vary; for instance, defining an incorrect causal direction is
+generally more problematic than including too many (upstream) nodes as potential parents of a node.
+
+**Causal Mechanism Assumption:** To model the causal data generation process, we represent each node with a causal
+mechanism of the form :math:`X_i = f_i(PA_i, N_i)`, where :math:`N_i` denotes unobserved noise, and :math:`PA_i`
+represents the causal parents of :math:`X_i`. In this context, we require an additional assumption regarding the
+form of the function :math:`f_i`. For continuous variables, for instance, it is common to model :math:`f_i` using an
+additive noise model of the form :math:`X_i = f_i(PA_i) + N_i`. However, this representation may not be accurate if the
+true relationship is different (e.g., multiplicative). Thus, the type of causal mechanism is another factor that can
+influence the results. Generally, however, the additive noise model assumption in the continuous case tends to be
+relatively robust to violations in practice.
+
+**Model Selection:** The previous two points focused solely on accurately representing the causal relationships between
+variables. Now, the process of model selection adds an additional layer. Sticking to the additive noise model
+:math:`X_i = f_i(PA_i) + N_i` as an example, the challenge in model selection lies in determining the optimal model
+for :math:`f_i`. Ideally, this would be the model that minimizes the mean squared error.
+
+Given the multitude of factors that impact the performance of a GCM, each with their own metrics and challenges,
+``dowhy.gcm`` has a module that aims at evaluating a fitted GCM and providing an overview of different evaluation
+metrics. Furthermore, if the auto assignment is used, we can obtain an overview of the evaluated models and
+performances in the selection process.
+
+Summary of auto assignment
+--------------------------
+
+If we use the auto assignment function, we obtain additional insights into the model selection process. To illustrate
+this, consider the chain structure example X→Y→Z:
+
+>>> import numpy as np, pandas as pd
+>>> import networkx as nx
+>>> import dowhy.gcm as gcm
+>>>
+>>> X = np.random.normal(loc=0, scale=1, size=1000)
+>>> Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
+>>> Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
+>>> data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
+>>>
+>>> causal_model = gcm.StructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')]))
+>>> summary_auto_assignment = gcm.auto.assign_causal_mechanisms(causal_model, data)
+>>> print(summary_auto_assignment)
+
+.. code-block::
+
+    Analyzed 3 nodes.
+    --- Node: X
+    Node X is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
+    --- Node: Y
+    Node Y is a non-root node. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
+    This represents the causal relationship as Y := f(X) + N.
+    For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
+    LinearRegression: 1.0023387259040388
+    Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
+                    ('linearregression', LinearRegression)]): 1.0099017476403862
+    HistGradientBoostingRegressor: 1.1091403766880177
+    Based on the type of causal mechanism, the model with the lowest metric value represents the best choice.
+
+    --- Node: Z
+    Node Z is a non-root node. Assigning 'AdditiveNoiseModel using LinearRegression' to the node.
+    This represents the causal relationship as Z := f(Y) + N.
+    For the model selection, the following models were evaluated on the mean squared error (MSE) metric:
+    LinearRegression: 0.9451918596711175
+    Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(include_bias=False)),
+                    ('linearregression', LinearRegression)]): 0.9488259577453813
+    HistGradientBoostingRegressor: 1.682146254853607
+    Based on the type of causal mechanism, the model with the lowest metric value represents the best choice.
+
+In this scenario, an empirical distribution is assigned to the root node X, while additive noise models are applied
+to nodes Y and Z. In both of these cases, a linear regression model demonstrated the best performance in terms
+of minimizing the mean squared error. A list of evaluated models and their performance is also available. Since we used
+the default parameter for the auto assignment, only a small model zoo is evaluated. However, we can also adjust the
+assigment quality to extend it to more models.
+
+After assigning causal mechanisms to each node, the subsequent step involves fitting these mechanisms to the data:
+
+>>> gcm.fit(causal_model, data)
+
+Evaluating a fitted GCM
+-----------------------
+
+The causal model has been fitted and can be used for different causal questions. However, we might be interested in
+obtaining some insights into the model performance first, i.e., we might wonder:
+
+- How well do my causal mechanisms perform?
+- Is the additive noise model assumption even valid for my data?
+- Does the GCM capture the joint distribution of the observed data?
+- Is my causal graph structure compatible with the data?
+
+For this, we can use the causal model evaluation function, which provides us with some insights into the overall model
+performance and whether our assumptions hold:
+
+>>> summary_evaluation = gcm.evaluate_causal_model(causal_model, data, compare_mechanism_baselines=True)
+>>> print(summary_evaluation)
+
+.. code-block::
+
+    Evaluated the performance of the causal mechanisms and the invertibility assumption of the causal mechanisms and the overall average KL divergence between generated and observed distribution and graph structure. The results are as follows:
+
+    ==== Evaluation of Causal Mechanisms ====
+    Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
+    Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.
+
+    --- Node X: The KL divergence between generated and observed distribution is 0.020548269898818708.
+    The estimated KL divergence indicates an overall very good representation of the data distribution.
+
+    --- Node Y: The normalized CRPS of this node is 0.26169914525652427.
+    The estimated CRPS indicates a good model performance.
+
+    --- Node Z: The normalized CRPS of this node is 0.08497732548860475.
+    The estimated CRPS indicates a very good model performance.
+
+    ==== Evaluation of Invertible Functional Causal Model Assumption ====
+
+    --- The model assumption for node Y is not rejected with a p-value of 0.9261751353508025 (after potential adjustment) and a significance level of 0.05.
+    This implies that the model assumption might be valid.
+
+    --- The model assumption for node Z is not rejected with a p-value of 1.0 (after potential adjustment) and a significance level of 0.05.
+    This implies that the model assumption might be valid.
+
+    Note that these results are based on statistical independence tests, and the fact that the assumption was not rejected does not necessarily imply that it is correct. There is just no evidence against it.
+
+    ==== Evaluation of Generated Distribution ====
+    The overall average KL divergence between the generated and observed distribution is 0.04045436327952057
+    The estimated KL divergence indicates an overall very good representation of the data distribution.
+
+    ==== Evaluation of the Causal Graph Structure ====
+    +-------------------------------------------------------------------------------------------------------+
+    |                                         Falsificaton Summary                                          |
+    +-------------------------------------------------------------------------------------------------------+
+    | The given DAG is not informative because 2 / 6 of the permutations lie in the Markov                  |
+    | equivalence class of the given DAG (p-value: 0.33).                                                   |
+    | The given DAG violates 0/1 LMCs and is better than 66.7% of the permuted DAGs (p-value: 0.33).        |
+    | Based on the provided significance level (0.05) and because the DAG is not informative,               |
+    | we do not reject the DAG.                                                                             |
+    +-------------------------------------------------------------------------------------------------------+
+
+    ==== NOTE ====
+    Always double check the made model assumptions with respect to the graph structure and choice of causal mechanisms.
+    All these evaluations give some insight into the goodness of the causal model, but should not be overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many algorithms are fairly robust against misspecifications or poor performances of causal mechanisms.
+
+.. image:: graph_evaluation.png
+   :alt: Causal Graph Falsification
+
+
+As we see, we get a detailed overview of different evaluations:
+
+**Evaluation of Causal Mechanisms:** Evaluation of the causal mechanisms with respect to their model performance.
+The performance of non-root nodes is measured using the Continuous Ranked Probability Score (CRPS), and the performance
+of root nodes is measured using the KL divergence between the generated and observed data distributions.
+
+Optionally, we can set the `compare_mechanism_baselines` parameter to `True` in order
+to compare the mechanisms with some baseline models. This gives us better insights into how the mechanisms perform in
+comparison with other models. Note, however, that this can take significant time for larger graphs.
+
+**Evaluation of Invertible Functional Causal Model Assumption:** If the causal mechanism is an invertible functional
+causal model, we can validate if the assumption holds true. Note that an invertible function here means with respect to
+the noise, i.e., an additive noise model :math:`X_i = f_i(PA_i) + N_i` and, more generally, post non-linear models
+:math:`X_i = g_i(f_i(PA_i) + N_i)` are examples for such types of mechanisms. In this case, the estimated noise based on
+the observation should be independent of the inputs.
+
+**Evaluation of Generated Distribution:** Since the GCM is able to generate new samples from the learned distributions,
+we can evaluate whether the generated (joint) distribution coincides with the observed one. Here, the difference should
+be as small as possible.
+
+**Evaluation of the Causal Graph Structure:** The graph structure should represent the (conditional) independencies
+in the observed data (assuming faithfulness). This can be exploited to obtain some insights on whether the given
+graph violates the (in)dependence structures based on the data. For this, an algorithm is used that checks whether the
+graph can be rejected.
+
+Note that all these evaluation methods only provide some insights into the provided GCM, but cannot fully confirm
+the correctness of a learned model. More details about the metrics and evaluation methods, please see the corresponding
+docstring of the method.
\ No newline at end of file
diff --git a/dowhy/gcm/__init__.py b/dowhy/gcm/__init__.py
index a8d75243fb..735d5d40e8 100644
--- a/dowhy/gcm/__init__.py
+++ b/dowhy/gcm/__init__.py
@@ -27,6 +27,7 @@
 )
 from .influence import arrow_strength, intrinsic_causal_influence
 from .ml import ClassificationModel, PredictionModel
+from .model_evaluation import evaluate_causal_model
 from .stochastic_models import BayesianGaussianMixtureDistribution, EmpiricalDistribution, ScipyDistribution
 from .unit_change import unit_change
 from .validation import RejectionResult, refute_causal_structure, refute_invertible_model
diff --git a/dowhy/gcm/auto.py b/dowhy/gcm/auto.py
index 72f81adccc..0d8021eceb 100644
--- a/dowhy/gcm/auto.py
+++ b/dowhy/gcm/auto.py
@@ -72,7 +72,7 @@
 ]
 _LIST_OF_POTENTIAL_REGRESSORS_BETTER = _LIST_OF_POTENTIAL_REGRESSORS_GOOD + [
     create_ridge_regressor,
-    partial(create_lasso_regressor, max_iter=5000),
+    partial(create_lasso_regressor, max_iter=10000),
     create_random_forest_regressor,
     create_support_vector_regressor,
     create_extra_trees_regressor,
@@ -111,7 +111,7 @@ def __str__(self):
         summary_strings.append("Analyzed %d nodes." % len(list(self._nodes)))
 
         for node in self._nodes:
-            summary_strings.append("--- Node: %s" % node)
+            summary_strings.append("\n--- Node: %s" % node)
             summary_strings.extend(self._nodes[node]["messages"])
 
             if len(self._nodes[node]["model_performances"]) > 0:
diff --git a/dowhy/gcm/causal_models.py b/dowhy/gcm/causal_models.py
index 2813c6e916..2e0d366544 100644
--- a/dowhy/gcm/causal_models.py
+++ b/dowhy/gcm/causal_models.py
@@ -141,8 +141,8 @@ def validate_local_structure(causal_graph: DirectedGraph, node: Any) -> None:
         PARENTS_DURING_FIT
     ] != get_ordered_predecessors(causal_graph, node):
         raise RuntimeError(
-            "The causal mechanism of node %s is not fitted to the graphical structure! Fit all"
-            "causal models in the graph first. If the mechanism is already fitted based on the causal"
+            "The causal mechanism of node %s is not fitted to the graphical structure! Fit all "
+            "causal models in the graph first. If the mechanism is already fitted based on the causal "
             "parents, consider to update the persisted parents for that node manually." % node
         )
 
diff --git a/dowhy/gcm/divergence.py b/dowhy/gcm/divergence.py
index 3902b339d1..13833d5bf9 100644
--- a/dowhy/gcm/divergence.py
+++ b/dowhy/gcm/divergence.py
@@ -21,7 +21,11 @@ def auto_estimate_kl_divergence(X: np.ndarray, Y: np.ndarray) -> float:
         if X.ndim == 2 and X.shape[1] > 1:
             return estimate_kl_divergence_continuous_clf(X, Y)
         else:
-            return estimate_kl_divergence_continuous_knn(X, Y)
+            try:
+                return estimate_kl_divergence_continuous_knn(X, Y)
+            except _KNNTooFewSamples:
+                # If there are too many common elements, this error can happen.
+                return estimate_kl_divergence_continuous_clf(X, Y)
 
 
 def estimate_kl_divergence_continuous_knn(
@@ -59,6 +63,11 @@ def estimate_kl_divergence_continuous_knn(
     # a division by zero.
     if remove_common_elements:
         X = setdiff2d(X, Y, assume_unique=True)
+        if X.shape[0] < k + 1:
+            raise _KNNTooFewSamples(
+                "After removing common elements, there are too few samples left! "
+                "Got %d samples with k=%d." % (X.shape[0], k)
+            )
 
     n, m = X.shape[0], Y.shape[0]
     if n == 0:
@@ -197,3 +206,9 @@ def is_probability_matrix(X: np.ndarray) -> bool:
         return False
     else:
         return np.all(np.isclose(np.sum(abs(X.astype(np.float64)), axis=1), 1))
+
+
+class _KNNTooFewSamples(Exception):
+    def __init__(self, message: str):
+        self.message = message
+        super().__init__(self.message)
diff --git a/dowhy/gcm/falsify.py b/dowhy/gcm/falsify.py
index 08830e8e6a..3a794c348a 100644
--- a/dowhy/gcm/falsify.py
+++ b/dowhy/gcm/falsify.py
@@ -19,7 +19,6 @@
 from dowhy.gcm.independence_test import kernel_based
 from dowhy.gcm.util import plot
 from dowhy.gcm.util.general import set_random_seed
-from dowhy.gcm.validation import _get_non_descendants
 from dowhy.graph import DirectedGraph, get_ordered_predecessors
 
 VIOLATION_COLOR = "red"
@@ -970,3 +969,10 @@ def _to_frozenset(x: Union[Set, List, str]):
     if isinstance(x, str):
         return frozenset({x})
     return frozenset(x)
+
+
+def _get_non_descendants(causal_graph: DirectedGraph, node: Any, exclude_parents: bool = False) -> List[Any]:
+    nodes_to_exclude = nx.descendants(causal_graph, node).union({node})
+    if exclude_parents:
+        nodes_to_exclude = nodes_to_exclude.union(causal_graph.predecessors(node))
+    return list(set(causal_graph.nodes).difference(nodes_to_exclude))
diff --git a/dowhy/gcm/ml/regression.py b/dowhy/gcm/ml/regression.py
index 7b4a6e07eb..e4ba51de06 100644
--- a/dowhy/gcm/ml/regression.py
+++ b/dowhy/gcm/ml/regression.py
@@ -74,7 +74,7 @@ def clone(self):
 
 
 def create_linear_regressor_with_given_parameters(
-    coefficients: np.ndarray, intercept: float = 0, **kwargs
+    coefficients: np.ndarray, intercept: float = 0
 ) -> LinearRegressionWithFixedParameter:
     return LinearRegressionWithFixedParameter(np.array(coefficients), intercept)
 
diff --git a/dowhy/gcm/model_evaluation.py b/dowhy/gcm/model_evaluation.py
new file mode 100644
index 0000000000..857251cff4
--- /dev/null
+++ b/dowhy/gcm/model_evaluation.py
@@ -0,0 +1,795 @@
+from dataclasses import dataclass
+from functools import partial
+from typing import Any, Callable, Dict, List, Optional, Tuple
+
+import networkx as nx
+import numpy as np
+import pandas as pd
+from joblib import Parallel, delayed
+from numpy.matlib import repmat
+from scipy.stats import mode
+from sklearn.metrics import f1_score, mean_squared_error
+from sklearn.model_selection import KFold
+from statsmodels.stats.multitest import multipletests
+from tqdm import tqdm
+
+from dowhy.gcm import (
+    ClassifierFCM,
+    PostNonlinearModel,
+    PredictionModel,
+    ProbabilisticCausalModel,
+    config,
+    draw_samples,
+    kernel_based,
+)
+from dowhy.gcm.causal_mechanisms import AdditiveNoiseModel, ConditionalStochasticModel, InvertibleFunctionalCausalModel
+from dowhy.gcm.divergence import auto_estimate_kl_divergence
+from dowhy.gcm.falsify import EvaluationResult, falsify_graph, plot_evaluation_results
+from dowhy.gcm.ml import (
+    SklearnRegressionModel,
+    create_hist_gradient_boost_classifier,
+    create_hist_gradient_boost_regressor,
+    create_lasso_regressor,
+    create_linear_regressor,
+    create_logistic_regression_classifier,
+    create_random_forest_classifier,
+    create_random_forest_regressor,
+    create_ridge_regressor,
+)
+from dowhy.gcm.ml.classification import (
+    create_ada_boost_classifier,
+    create_gaussian_nb_classifier,
+    create_knn_classifier,
+    create_polynom_logistic_regression_classifier,
+)
+from dowhy.gcm.ml.regression import create_ada_boost_regressor, create_extra_trees_regressor, create_polynom_regressor
+from dowhy.gcm.stats import quantile_based_fwer
+from dowhy.gcm.util.general import is_categorical, set_random_seed, shape_into_2d
+from dowhy.graph import get_ordered_predecessors, is_root_node
+
+
+class EvaluateCausalModelConfig:
+    """Config for the causal model evaluation."""
+
+    def __init__(
+        self,
+        mechanism_evaluation_kfolds: int = 5,
+        baseline_models_regression: Optional[List[Callable[[], PredictionModel]]] = None,
+        baseline_models_classification: Optional[List[Callable[[], PredictionModel]]] = None,
+        independence_test_invertible: Callable[[np.ndarray, np.ndarray], float] = partial(
+            kernel_based, use_bootstrap=False
+        ),
+        significance_level_invertible: float = 0.05,
+        fdr_control_method_invertible: Optional[str] = "bonferroni",
+        bootstrap_runs_invertible: int = 5,
+        max_num_permutations_falsify: int = 50,
+        independence_test_falsify: Callable[[np.ndarray, np.ndarray], float] = partial(
+            kernel_based, use_bootstrap=False, max_num_samples_run=500
+        ),
+        conditional_independence_test_falsify: Callable[[np.ndarray, np.ndarray, np.ndarray], float] = partial(
+            kernel_based, use_bootstrap=False, max_num_samples_run=500
+        ),
+        n_jobs: Optional[int] = None,
+    ):
+        """Parameters for the causal model evaluation method. See the parameter description for more details.
+
+        :param mechanism_evaluation_kfolds: Number of folds for evaluating the causal mechanisms.
+        :param baseline_models_regression: Baseline models for continuous nodes. The causal mechanisms assigned to the
+                                           nodes in the graph are compared against additive noise models with these
+                                           baseline regression models.
+        :param baseline_models_classification: Baseline models for categorical nodes. The causal mechanisms assigned to the
+                                               nodes in the graph are compared against these baseline models.
+        :param independence_test_invertible: A method for testing the independence between inputs and estimated noise of invertible
+                                             causal mechanisms. This is used to evaluate whether the made model assumptions hold.
+        :param significance_level_invertible: The significance level for rejecting the null hypothesis that inputs and residuals
+                                              are independent.
+        :param fdr_control_method_invertible: The false discovery rate control method when running multiple hypothesis tests. Note that
+                                              we can assume that the tests are independent.
+        :param bootstrap_runs_invertible: The independence tests are only run on a small subset of samples. This parameter
+                                          indicates how many subsets the tests should be performed on. The resulting p-values are
+                                          aggregated using a family-wise error control method.
+        :param max_num_permutations_falsify: Number of permutations used for falsifying the given graph structure.
+        :param independence_test_falsify: A method for testing the independence between two variables used for falsifying
+                                          the given graph structure. Note that the variables can be multivariate.
+        :param conditional_independence_test_falsify: A method for testing the independence between two variables given a
+                                                      third one used for falsifying the given graph structure. Note that
+                                                      the variables can be multivariate.
+        :param n_jobs: Number of parallel jobs. Whenever the evaluation method supports parallelization, this parameter
+                       is used.
+        """
+        n_jobs = config.default_n_jobs if n_jobs is None else n_jobs
+
+        if baseline_models_regression is None:
+            baseline_models_regression = [
+                create_linear_regressor,
+                create_polynom_regressor,
+                create_hist_gradient_boost_regressor,
+                create_ridge_regressor,
+                partial(create_lasso_regressor, max_iter=10000),
+                create_random_forest_regressor,
+                create_extra_trees_regressor,
+                create_ada_boost_regressor,
+            ]
+
+        if baseline_models_classification is None:
+            baseline_models_classification = [
+                partial(create_logistic_regression_classifier, max_iter=10000),
+                partial(create_polynom_logistic_regression_classifier, max_iter=10000),
+                create_hist_gradient_boost_classifier,
+                create_random_forest_classifier,
+                create_gaussian_nb_classifier,
+                create_knn_classifier,
+                create_ada_boost_classifier,
+            ]
+
+        self.mechanism_evaluation_kfolds = mechanism_evaluation_kfolds
+        self.baseline_models_regression = baseline_models_regression
+        self.baseline_models_classification = baseline_models_classification
+        self.independence_test_invertible = independence_test_invertible
+        self.significance_level_invertible = significance_level_invertible
+        self.fdr_control_method_invertible = fdr_control_method_invertible
+        self.bootstrap_runs_invertible = bootstrap_runs_invertible
+        self.max_num_permutations_falsify = max_num_permutations_falsify
+        self.independence_test_falsify = independence_test_falsify
+        self.conditional_independence_test_falsify = conditional_independence_test_falsify
+        self.n_jobs = n_jobs
+
+
+@dataclass
+class CausalModelEvaluationResult:
+    model_performances: Optional[
+        Dict[Any, Tuple[bool, float, Optional[int], Optional[str], int, Optional[float]]]
+    ] = None
+    pnl_assumptions: Optional[Dict[Any, Tuple[float, str, Optional[float]]]] = None
+    graph_falsification: Optional[EvaluationResult] = None
+    overall_kl_divergence: Optional[float] = None
+    plot_falsification_histogram: bool = True
+
+    def __str__(self):
+        summary_string = "Evaluated"
+        if self.model_performances is not None:
+            summary_string += " the performance of the causal mechanisms"
+        if self.pnl_assumptions is not None:
+            summary_string += " and the invertibility assumption of the causal mechanisms"
+        if self.overall_kl_divergence is not None:
+            summary_string += " and the overall average KL divergence between generated and observed distribution"
+        if self.graph_falsification is not None:
+            summary_string += " and graph structure"
+
+        summary_string += ". The results are as follows:"
+        summary_strings = [summary_string]
+
+        if self.model_performances is not None:
+            summary_strings.append("\n==== Evaluation of Causal Mechanisms ====")
+            summary_strings.append(
+                "Root nodes are evaluated based on the KL divergence between the generated "
+                "and the observed distribution."
+            )
+            summary_strings.append(
+                "Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score "
+                "(CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic "
+                "predictions. Since the causal mechanisms produce conditional distributions, this "
+                "should give some insights into their performance and calibration. However, note that many algorithms "
+                "are still relatively robust against poor model performances."
+            )
+
+            for node in self.model_performances:
+                if self.model_performances[node][0]:
+                    summary_strings.append(
+                        "\n--- Node %s: The KL divergence between generated and observed distribution is %s."
+                        % (node, self.model_performances[node][1])
+                    )
+                    summary_strings.append(_get_kl_divergence_interpretation_string(self.model_performances[node][1]))
+                else:
+                    summary_strings.append(
+                        "\n--- Node %s: The normalized CRPS of this node is %s."
+                        % (node, self.model_performances[node][1])
+                    )
+                    summary_strings.append(_get_crps_interpretation_string(self.model_performances[node][1]))
+
+                    if self.model_performances[node][2] is not None:
+                        summary_strings.append(
+                            _get_baseline_model_interpretation_string(
+                                self.model_performances[node][2],
+                                self.model_performances[node][4],
+                                self.model_performances[node][3],
+                                self.model_performances[node][5],
+                            )
+                        )
+
+        if self.pnl_assumptions is not None:
+            summary_strings.append("\n==== Evaluation of Invertible Functional Causal Model Assumption ====")
+            if len(self.pnl_assumptions) == 0:
+                summary_strings.append("The causal model has no invertible causal models.")
+            else:
+                for node in self.pnl_assumptions:
+                    rejected = "rejected" if self.pnl_assumptions[node][1] else "not rejected"
+                    summary_strings.append(
+                        "\n--- The model assumption for node %s is %s with a p-value of %s "
+                        "(after potential adjustment) and a significance level of %s."
+                        % (node, rejected, self.pnl_assumptions[node][0], self.pnl_assumptions[node][2])
+                    )
+                    if not self.pnl_assumptions[node][1]:
+                        summary_strings.append("This implies that the model assumption might be valid.")
+                    else:
+                        summary_strings.append(
+                            "This implies that the model assumption might not be valid. This is, "
+                            "the relationship cannot be represent with this type of mechanism or "
+                            "there is a hidden confounder between the node and its parents."
+                        )
+
+                summary_strings.append(
+                    "\nNote that these results are based on statistical independence tests, and the "
+                    "fact that the assumption was not rejected does not necessarily imply that it "
+                    "is correct. There is just no evidence against it."
+                )
+
+        if self.overall_kl_divergence is not None:
+            summary_strings.append("\n==== Evaluation of Generated Distribution ====")
+            summary_strings.append(
+                "The overall average KL divergence between the generated and observed distribution is %s"
+                % self.overall_kl_divergence
+            )
+            summary_strings.append(_get_kl_divergence_interpretation_string(self.overall_kl_divergence))
+
+        if self.graph_falsification is not None:
+            summary_strings.append("\n==== Evaluation of the Causal Graph Structure ====")
+            summary_strings.append(str(self.graph_falsification))
+            if self.plot_falsification_histogram:
+                plot_evaluation_results(self.graph_falsification)
+
+        summary_strings.append("\n==== NOTE ====")
+        summary_strings.append(
+            "Always double check the made model assumptions with respect to the graph structure and "
+            "choice of causal mechanisms."
+        )
+        summary_strings.append(
+            "All these evaluations give some insight into the goodness of the causal model, but should not be "
+            "overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many "
+            "algorithms are fairly robust against misspecifications or poor performances of causal mechanisms."
+        )
+
+        return "\n".join(summary_strings)
+
+
+def evaluate_causal_model(
+    causal_model: ProbabilisticCausalModel,
+    data: pd.DataFrame,
+    max_num_samples: int = -1,
+    evaluate_causal_mechanisms: bool = True,
+    compare_mechanism_baselines: bool = False,
+    evaluate_invertibility_assumptions: bool = True,
+    evaluate_overall_kl_divergence: bool = True,
+    evaluate_causal_structure: bool = True,
+    config: Optional[EvaluateCausalModelConfig] = None,
+) -> CausalModelEvaluationResult:
+    """Evaluates the given causal model by running different evaluations.
+
+    Evaluation of Causal Mechanisms:
+    The quality of the causal mechanisms is assessed using k-fold cross validation. This means that the models are trained
+    from scratch multiple times, which might take a significant amount of time for larger models. Within each fold, the models
+    are assessed using the (normalized) continuous ranked probability score (CRPS). The normalization is with respect to
+    the standard deviation of the target values. Optionally, the mechanisms are compared with baseline models to see if
+    they are performing significantly better or equally good.
+
+    Evaluation of Invertible Functional Causal Model Assumption:
+    Invertible causal mechanisms rely on the assumption that the inputs are independent of the reconstructed noise.
+    This is, assuming there are no hidden confounders, the noise should be independent of parents of a node. This
+    can be evaluated by testing statistical independence between the reconstructed noise and the used input samples.
+
+    Evaluation of Generated Distribution:
+    The distribution generated by the causal model is compared with the observed data using KL divergence. To avoid
+    estimating the KL divergence of high-dimensional data, we approximate it by calculating the mean KL divergence
+    across the individual marginal KL divergences for each node.
+
+    Evaluation of the Causal Graph Structure:
+    The causal graph structure is evaluated by running a method to falsify the graph. The method involves conducting
+    independence tests and may consume a significant amount of time for more extensive graphs. The results provide
+    an indication of whether the graph is rejected or not. It's important to note that a non-rejected graph does not
+    guarantee its correctness. It simply means that the evaluation did not find substantial evidence to refute the causal
+    graph based on the provided data. However, a rejected graph might indicate potential issues with its structure.
+
+    The outcomes of these evaluation methods should be interpreted with caution, and bad fits should not be
+    over-interpreted. Nonetheless, the results can offer insights into the performance of the causal model and potential
+    areas for improvement.
+
+    :param causal_model: The causal model to evaluate.
+    :param data: The data used for the evaluation.
+    :param max_num_samples: The maximum number of samples used for the evaluation. If the runtime is too slow, consider
+                            setting this to a smaller value. The default -1 indicates that all samples are used.
+    :param evaluate_causal_mechanisms: If True, the causal mechanisms are evaluated.
+    :param compare_mechanism_baselines: If True, the causal mechanisms are compared with baseline models to see
+                                        if there are model choices that perform significantly better. If False, this
+                                        comparison is skipped. This is ignored if evaluate_causal_mechanisms is False.
+    :param evaluate_invertibility_assumptions: If True, the model assumption represented by invertible causal mechanisms
+                                               is tested.
+    :param evaluate_overall_kl_divergence: If True, the KL divergence between the generated and the observed data is
+                                           estimated.
+    :param evaluate_causal_structure: If True, the causal graph structure is evaluated.
+    :return: A summary of the evaluation.
+    """
+    if config is None:
+        config = EvaluateCausalModelConfig()
+
+    evaluation_result = CausalModelEvaluationResult()
+
+    if max_num_samples >= 0 and max_num_samples < data.shape[0]:
+        data = data[np.random.choice(data.shape[0], data.shape[0], replace=False)]
+
+    if evaluate_causal_mechanisms:
+        evaluation_result.model_performances = _evaluate_model_performances(
+            causal_model,
+            data,
+            compare_mechanism_baselines,
+            config.baseline_models_regression,
+            config.baseline_models_classification,
+            config.mechanism_evaluation_kfolds,
+            config.n_jobs,
+        )
+
+    if evaluate_invertibility_assumptions:
+        evaluation_result.pnl_assumptions = _evaluate_invertibility_assumptions(
+            causal_model,
+            data,
+            config.independence_test_invertible,
+            config.significance_level_invertible,
+            config.fdr_control_method_invertible,
+            config.bootstrap_runs_invertible,
+        )
+
+    if evaluate_overall_kl_divergence:
+        # Normally, we need to estimate the KL divergence jointly. However, to avoid issues with high dimensional data,
+        # we approximate it by taking the average over the marginal KL divergences.
+        drawn_samples = draw_samples(causal_model, data.shape[0])
+
+        evaluation_result.overall_kl_divergence = np.mean(
+            [
+                auto_estimate_kl_divergence(drawn_samples[node].to_numpy(), data[node].to_numpy())
+                for node in causal_model.graph.nodes
+            ]
+        )
+
+    if evaluate_causal_structure:
+        evaluation_result.graph_falsification = falsify_graph(
+            causal_model.graph,
+            data,
+            n_permutations=config.max_num_permutations_falsify,
+            independence_test=config.independence_test_falsify,
+            conditional_independence_test=config.conditional_independence_test_falsify,
+            n_jobs=config.n_jobs,
+        )
+
+    return evaluation_result
+
+
+def _evaluate_model_performances(
+    causal_model: ProbabilisticCausalModel,
+    data: pd.DataFrame,
+    compare_mechanism_baselines,
+    baseline_models_regression: List[Callable[[], PredictionModel]],
+    baseline_models_classification: List[Callable[[], PredictionModel]],
+    kfolds: int,
+    n_jobs: int,
+) -> Dict[Any, Tuple[bool, float, Optional[int], Optional[str], int, Optional[float]]]:
+    model_performances = {}
+
+    def evaluate_node(node_name, random_seed):
+        set_random_seed(random_seed)
+
+        node_data = data[node_name].to_numpy()
+        metric_evaluations = []
+        baseline_crps = {}
+        categorical = is_categorical(node_data)
+
+        if is_root_node(causal_model.graph, node_name):
+            for training_indices, test_indices in KFold(n_splits=kfolds, shuffle=True).split(node_data):
+                tmp_causal_mechanism = causal_model.causal_mechanism(node_name).clone()
+                tmp_causal_mechanism.fit(node_data[training_indices])
+
+                metric_evaluations.append(
+                    auto_estimate_kl_divergence(
+                        tmp_causal_mechanism.draw_samples(len(test_indices)), node_data[test_indices]
+                    )
+                )
+        else:
+            parent_data = data[get_ordered_predecessors(causal_model.graph, node_name)].to_numpy()
+
+            for training_indices, test_indices in KFold(n_splits=kfolds, shuffle=True).split(parent_data):
+                tmp_causal_mechanism = causal_model.causal_mechanism(node_name).clone()
+                tmp_causal_mechanism.fit(parent_data[training_indices], node_data[training_indices])
+
+                metric_evaluations.append(
+                    crps(parent_data[test_indices], node_data[test_indices], tmp_causal_mechanism.draw_samples)
+                )
+
+                if not compare_mechanism_baselines:
+                    continue
+
+                if categorical:
+                    for baseline_mdl_factory in baseline_models_classification:
+                        tmp_classifier_mdl = baseline_mdl_factory()
+                        if (
+                            isinstance(tmp_causal_mechanism, ClassifierFCM)
+                            and isinstance(tmp_causal_mechanism.classifier_model, SklearnRegressionModel)
+                            and tmp_causal_mechanism.classifier_model.sklearn_model.__class__
+                            == tmp_classifier_mdl.sklearn_model.__class__
+                        ):
+                            # Do not compare with same model class
+                            continue
+
+                        baseline_mechanism = ClassifierFCM(tmp_classifier_mdl)
+                        baseline_mechanism.fit(parent_data[training_indices], node_data[training_indices])
+
+                        baseline_crps.setdefault(str(baseline_mechanism), []).append(
+                            crps(parent_data[test_indices], node_data[test_indices], baseline_mechanism.draw_samples)
+                        )
+                else:
+                    for baseline_mdl_factory in baseline_models_regression:
+                        tmp_reg_mdl = baseline_mdl_factory()
+                        if (
+                            isinstance(tmp_causal_mechanism, PostNonlinearModel)
+                            and isinstance(tmp_causal_mechanism.prediction_model, SklearnRegressionModel)
+                            and tmp_causal_mechanism.prediction_model.sklearn_model.__class__
+                            == tmp_reg_mdl.sklearn_model.__class__
+                        ):
+                            # Do not compare with same model class
+                            continue
+
+                        baseline_mechanism = AdditiveNoiseModel(tmp_reg_mdl)
+                        baseline_mechanism.fit(parent_data[training_indices], node_data[training_indices])
+
+                        baseline_crps.setdefault(str(baseline_mechanism), []).append(
+                            crps(parent_data[test_indices], node_data[test_indices], baseline_mechanism.draw_samples)
+                        )
+
+        if len(metric_evaluations) == 0:
+            mean_metric = None
+        else:
+            mean_metric = float(np.mean(metric_evaluations))
+
+        count_better_performance = None
+        best_baseline_performance = None
+        best_baseline_model = None
+
+        if compare_mechanism_baselines:
+            count_better_performance = 0
+
+            for k in baseline_crps:
+                baseline_crps[k] = float(np.mean(baseline_crps[k]))
+
+                if mean_metric - baseline_crps[k] > 0.05:
+                    count_better_performance += 1
+
+                if best_baseline_performance is None:
+                    best_baseline_model = k
+                    best_baseline_performance = baseline_crps[k]
+
+                if baseline_crps[k] < best_baseline_performance:
+                    best_baseline_model = k
+                    best_baseline_performance = baseline_crps[k]
+
+        return (
+            node_name,
+            is_root_node(causal_model.graph, node_name),
+            mean_metric,
+            count_better_performance,
+            best_baseline_model,
+            len(baseline_crps),
+            best_baseline_performance,
+        )
+
+    random_seeds = np.random.randint(np.iinfo(np.int32).max, size=len(causal_model.graph.nodes))
+    all_results = Parallel(n_jobs=n_jobs)(
+        delayed(evaluate_node)(node, int(random_seeds[i]))
+        for i, node in enumerate(
+            tqdm(
+                list(nx.topological_sort(causal_model.graph)),
+                position=0,
+                leave=True,
+                disable=not config.show_progress_bars,
+                desc="Evaluating causal mechanisms...",
+            )
+        )
+    )
+
+    for (
+        node_name,
+        root_node,
+        mean_metric,
+        count_better_performance,
+        best_baseline_model,
+        num_baselines,
+        best_baseline_performance,
+    ) in all_results:
+        model_performances[node_name] = (
+            root_node,
+            mean_metric,
+            count_better_performance,
+            best_baseline_model,
+            num_baselines,
+            best_baseline_performance,
+        )
+
+    return model_performances
+
+
+def _evaluate_invertibility_assumptions(
+    causal_model: ProbabilisticCausalModel,
+    data: pd.DataFrame,
+    independence_test_invertible: Callable[[np.ndarray, np.ndarray], float],
+    significance_level_invertible: float,
+    fdr_control_method_invertible: Optional[str],
+    bootstrap_runs_invertible: int,
+    max_num_samples_per_run: int = 2000,
+) -> Dict[Any, Tuple[float, bool, float]]:
+    all_pnl_p_values = {}
+
+    if max_num_samples_per_run >= data.shape[0]:
+        bootstrap_runs_invertible = 1
+
+    max_num_samples_per_run = min(max_num_samples_per_run, data.shape[0])
+
+    for node in causal_model.graph.nodes:
+        causal_mechanism = causal_model.causal_mechanism(node)
+        if isinstance(causal_mechanism, InvertibleFunctionalCausalModel):
+            parent_samples = data[get_ordered_predecessors(causal_model.graph, node)].to_numpy()
+            target_samples = data[node].to_numpy()
+
+            tmp_p_values = []
+            for run in range(bootstrap_runs_invertible):
+                random_indices = np.random.choice(target_samples.shape[0], max_num_samples_per_run, replace=False)
+                tmp_p_values.append(
+                    independence_test_invertible(
+                        causal_mechanism.estimate_noise(target_samples[random_indices], parent_samples[random_indices]),
+                        parent_samples[random_indices],
+                    )
+                )
+            all_pnl_p_values[node] = quantile_based_fwer(tmp_p_values)
+
+    if len(all_pnl_p_values) == 0:
+        return all_pnl_p_values
+
+    if fdr_control_method_invertible is not None:
+        to_adjust = []
+        node_names = []
+        for node in all_pnl_p_values:
+            node_names.append(node)
+            to_adjust.append(all_pnl_p_values[node])
+
+        multiple_test_result = multipletests(
+            to_adjust, significance_level_invertible, method=fdr_control_method_invertible
+        )
+
+        for i, node in enumerate(node_names):
+            all_pnl_p_values[node] = (
+                multiple_test_result[1][i],
+                multiple_test_result[0][i],
+                significance_level_invertible,
+            )
+    else:
+        for node in all_pnl_p_values:
+            all_pnl_p_values[node] = (
+                all_pnl_p_values[node],
+                all_pnl_p_values[node] < significance_level_invertible,
+                significance_level_invertible,
+            )
+
+    return all_pnl_p_values
+
+
+def _estimate_conditional_expectations(
+    causal_mechanism: ConditionalStochasticModel,
+    parent_samples: np.ndarray,
+    categorical: bool,
+    num_samples_conditional_samples: int,
+) -> np.ndarray:
+    if isinstance(causal_mechanism, PostNonlinearModel):
+        # In case of post non-linear models, we can obtain the conditional expectation directly based on the prediction
+        # model. To do this, we can just in pass 0 as the noise, since this would evaluate Y = f(X) + 0 in case of an
+        # additive noise model and Y = g(f(X) + 0) in case of a more general model.
+        return causal_mechanism.evaluate(parent_samples, np.zeros(parent_samples.shape[0])).reshape(-1)
+    elif isinstance(causal_mechanism, ClassifierFCM):
+        return causal_mechanism.classifier_model.predict(parent_samples).reshape(-1)
+    else:
+        if not categorical:
+            # Estimate the conditional expectation E[Y | x] by generating multiple samples for Y|x and average them.
+            y_preds = np.zeros(parent_samples.shape[0])
+            for _ in range(num_samples_conditional_samples):
+                y_preds += causal_mechanism.draw_samples(parent_samples).reshape(-1)
+
+            return y_preds / num_samples_conditional_samples
+        else:
+            # Since these are categorical values, we just need to look for the most frequent element after we drew
+            # multiple samples for each input.
+            all_draws = []
+            for _ in range(num_samples_conditional_samples):
+                all_draws.append(causal_mechanism.draw_samples(parent_samples).reshape(-1))
+
+            modes, _ = mode(np.array(all_draws), axis=0)
+
+            return np.array(modes[0].tolist())
+
+
+def nrmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
+    """Estimates the Normalized Root Mean Squared Error (NRMSE) based on the given samples. This is, the root mean
+    squared error normalized by the variance of the observed values.
+
+    :param y_true: Observed values.
+    :param y_pred: Predicted values.
+    :return: The normalized RMSE.
+    """
+    y_true = y_true.reshape(-1)
+    y_pred = y_pred.reshape(-1)
+
+    return mean_squared_error(y_true, y_pred, squared=False) / np.std(y_true)
+
+
+def crps(
+    X: np.ndarray,
+    Y: np.ndarray,
+    conditional_sampling_method: Callable[[np.ndarray], np.ndarray],
+    num_conditional_samples: int = 100,
+    normalize: bool = True,
+) -> float:
+    """Estimates the (normalized) Continuous Ranked Probability Score (CRPS) based on the given data and generation
+    process. This is used to check the calibration of a probabilistic prediction.
+
+    :param X: Observations of the input features.
+    :param Y: Observations of the corresponding target value.
+    :param conditional_sampling_method: Method to sample from the conditional given an input sample from X.
+    :param num_conditional_samples: Number of samples that should be drawn from the conditional to estimate the CRPS.
+    :param normalize: If True, the target values are normalized in the continuous case by the standard deviation of the
+                      expected Y values. By this, the CRPS become comparable across different scales.
+    :return: The Continuous Ranked Probability Score.
+    """
+
+    def empirical_crps(generated_Y, observed_y):
+        """Estimates \int (F(x) - 1_{x >= y})**2 dx = E[|X - y|] - 1/2 E[|X - X'|]
+
+        Here, X is generated_Y and y is observed_y. The X' are another set of generated_Y, however, we can also take
+        the difference over all combinations to estimate the expectation."""
+        generated_Y = generated_Y.reshape(-1)
+        observed_y = observed_y.squeeze()
+
+        return np.mean(np.abs(generated_Y - observed_y)) - 0.5 * np.mean(
+            np.abs(np.subtract.outer(generated_Y, generated_Y))
+        )
+
+    if Y.ndim > 1 and Y.shape[1] > 1:
+        raise ValueError("Y has to be one dimensional!")
+
+    X = shape_into_2d(X)
+    Y = Y.reshape(-1)
+
+    crps_values = []
+
+    categorical = is_categorical(Y)
+    if categorical:
+        # In the categorical case, this is equivalent to the Brier score. However, the following formulation allows
+        # categorical data with more than two classes.
+        all_classes = np.unique(Y)
+
+        for x, y in zip(X, Y):
+            samples = conditional_sampling_method(repmat(x, num_conditional_samples, 1))
+
+            sample_categorical_crps = []
+            for cat in all_classes:
+                sample_categorical_crps.append(
+                    empirical_crps((samples == cat).astype(int), np.array(y == cat).astype(int))
+                )
+            crps_values.append(np.mean(sample_categorical_crps))
+    else:
+        std_Y = 1
+
+        if normalize:
+            std_Y = np.std(Y)
+
+            if std_Y == 0:
+                std_Y = 1
+
+        for x, y in zip(X, Y):
+            crps_values.append(
+                empirical_crps(conditional_sampling_method(repmat(x, num_conditional_samples, 1)) / std_Y, y / std_Y)
+            )
+
+    return float(np.mean(np.array(crps_values)))
+
+
+def _get_kl_divergence_interpretation_string(kl_divergence: float) -> str:
+    if kl_divergence < 0.5:
+        return "The estimated KL divergence indicates an overall very good representation of the data distribution."
+    elif 0.5 <= kl_divergence < 1:
+        return (
+            "The estimated KL divergence indicates a good representation of the data distribution, but might "
+            "indicate some smaller mismatches between the distributions."
+        )
+    elif 1 <= kl_divergence < 1.5:
+        return "The estimated KL divergence indicates some mismatches between the distributions."
+    elif 1.5 <= kl_divergence < 3:
+        return "The estimated KL divergence indicates some significant mismatches between the distributions."
+    else:
+        return (
+            "The estimated KL divergence indicates a significant mismatches between the distributions. "
+            "Consider using models that better fit the distribution. However, also note that regardless of the model "
+            "choice, intrinsically weak connections or very small signal to noise ratios will always lead to a poor "
+            "fit."
+        )
+
+
+def _get_crps_interpretation_string(crps: float) -> str:
+    if crps < 0.2:
+        return "The estimated CRPS indicates a very good model performance."
+    elif 0.2 <= crps < 0.35:
+        return "The estimated CRPS indicates a good model performance."
+    elif 0.35 <= crps < 0.65:
+        return (
+            "The estimated CRPS indicates a fair model performance. "
+            "Note, however, that a high CRPS could also result from a small signal to noise ratio."
+        )
+    elif 0.65 <= crps < 0.9:
+        return (
+            "The estimated CRPS indicates a rather poor model performance. Note, however, that a high CRPS could also "
+            "result from a very small signal to noise ratio."
+        )
+    elif 0.9 <= crps < 1:
+        return (
+            "The estimated CRPS indicates a bad model performance. Consider trying "
+            "alternative causal mechanism types. Note, however, that a high CRPS could also "
+            "result from a very small signal to noise ratio."
+        )
+    else:
+        return (
+            "The estimated CRPS indicates a very bad model performance. Consider trying alternative causal mechanism "
+            "types."
+        )
+
+
+def _get_baseline_model_interpretation_string(
+    count_better: int, count_total: int, best_baseline_model: str, best_baseline_performance: float
+) -> str:
+    percentage_better = count_better / count_total
+    best_baseline_model = best_baseline_model.replace("()", "")
+
+    if percentage_better == 0:
+        return "The mechanism is better or equally good than all %s baseline mechanisms." % str(count_total)
+    elif percentage_better < 0.2:
+        return (
+            str(count_better)
+            + " of "
+            + str(count_total)
+            + " baseline mechanisms performed significantly better. The best baseline mechanism is: "
+            + best_baseline_model
+            + " with a CRPS of "
+            + str(best_baseline_performance)
+            + "."
+        )
+    elif 0.2 < percentage_better < 0.4:
+        return (
+            str(count_better)
+            + " of "
+            + str(count_total)
+            + " baseline mechanisms performed significantly better. The best baseline "
+            "mechanism is: "
+            + best_baseline_model
+            + " with a CRPS of "
+            + str(best_baseline_performance)
+            + ". Accordingly, the current mechanism could be improved using a "
+            + best_baseline_model
+            + " instead. If you are using an auto assigment, consider a "
+            "better assignment quality level such as AssignmentQuality.BETTER."
+        )
+    else:
+        return (
+            str(count_better)
+            + " of "
+            + str(count_total)
+            + " baseline mechanisms performed significantly better. The best baseline "
+            "mechanism is: "
+            + best_baseline_model
+            + " with a CRPS of "
+            + str(best_baseline_performance)
+            + ". Seeing this, consider changing the mechanism to a "
+            + best_baseline_model
+            + " or, if you are using an auto assigment, consider a better assignment quality "
+            "level such as AssignmentQuality.BETTER."
+        )
diff --git a/tests/gcm/test_auto.py b/tests/gcm/test_auto.py
index cc4c8fcc63..db7c18e6bf 100644
--- a/tests/gcm/test_auto.py
+++ b/tests/gcm/test_auto.py
@@ -344,16 +344,22 @@ def test_given_continuous_data_when_print_auto_summary_then_returns_expected_for
     assert len(summary_result._nodes["Y"]["model_performances"]) > 0
 
     expected_summary = """Analyzed 6 nodes.
+
 --- Node: X0
 Node X0 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X1
 Node X1 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X2
 Node X2 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X3
 Node X3 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X4
 Node X4 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: Y
 Node Y is a non-root node. Assigning 'AdditiveNoiseModel using HistGradientBoostingRegressor' to the node.
 This represents the causal relationship as Y := f(X0,X1,X2,X3,X4) + N.
@@ -403,16 +409,22 @@ def test_given_categorical_data_when_print_auto_summary_then_returns_expected_fo
     assert len(summary_result._nodes["Y"]["model_performances"]) > 0
 
     expected_summary = """Analyzed 6 nodes.
+
 --- Node: X0
 Node X0 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X1
 Node X1 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X2
 Node X2 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X3
 Node X3 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: X4
 Node X4 is a root node. Assigning 'Empirical Distribution' to the node representing the marginal distribution.
+
 --- Node: Y
 Node Y is a non-root node. Assigning 'Classifier FCM based on LogisticRegression(max_iter=10000)' to the node.
 This represents the causal relationship as Y := f(X0,X1,X2,X3,X4,N).
diff --git a/tests/gcm/test_model_evaluation.py b/tests/gcm/test_model_evaluation.py
new file mode 100644
index 0000000000..2e7b56bbba
--- /dev/null
+++ b/tests/gcm/test_model_evaluation.py
@@ -0,0 +1,309 @@
+import networkx as nx
+import numpy as np
+import pandas as pd
+from _pytest.python_api import approx
+from flaky import flaky
+from scipy import stats
+
+from dowhy.gcm import (
+    AdditiveNoiseModel,
+    ClassifierFCM,
+    InvertibleStructuralCausalModel,
+    ScipyDistribution,
+    fit,
+    kernel_based,
+)
+from dowhy.gcm.auto import assign_causal_mechanisms
+from dowhy.gcm.ml import (
+    create_hist_gradient_boost_classifier,
+    create_hist_gradient_boost_regressor,
+    create_linear_regressor,
+    create_linear_regressor_with_given_parameters,
+    create_logistic_regression_classifier,
+)
+from dowhy.gcm.model_evaluation import (
+    EvaluateCausalModelConfig,
+    _estimate_conditional_expectations,
+    _evaluate_invertibility_assumptions,
+    crps,
+    evaluate_causal_model,
+    nrmse,
+)
+
+
+def test_given_good_fit_when_estimate_nrmse_then_returns_zero():
+    X = np.random.normal(0, 1, 1000)
+    Y = 2 * X
+
+    mdl = AdditiveNoiseModel(
+        create_linear_regressor_with_given_parameters(np.array([2]), intercept=0),
+        noise_model=ScipyDistribution(stats.norm, loc=0, scale=0),
+    )
+
+    assert nrmse(_estimate_conditional_expectations(mdl, X, False, 1), Y) == approx(0, abs=0.01)
+
+
+def test_given_bad_fit_when_estimate_nrmse_then_returns_high_value():
+    X = np.random.normal(0, 1, 1000)
+    Y = 2 * X
+
+    mdl = AdditiveNoiseModel(
+        create_linear_regressor_with_given_parameters(np.array([20]), intercept=0),
+        noise_model=ScipyDistribution(stats.norm, loc=0, scale=0),
+    )
+
+    assert nrmse(Y, _estimate_conditional_expectations(mdl, X, False, 1)) > 1
+
+
+def test_given_good_fit_but_noisy_data_when_estimate_nrmse_then_returns_expected_result():
+    X = np.random.normal(0, 1, 2000)
+    Y = 2 * X + np.random.normal(0, 1, 2000)
+
+    mdl = AdditiveNoiseModel(
+        create_linear_regressor_with_given_parameters(np.array([2]), intercept=0),
+        noise_model=ScipyDistribution(stats.norm, loc=0, scale=1),
+    )
+
+    # The RMSE should be 1 due to the variance of the noise. The NRMSE is accordingly 1 / std(Y).
+    assert nrmse(Y, _estimate_conditional_expectations(mdl, X, False, 1)) == approx(1 / np.std(Y), abs=0.05)
+
+
+def test_given_good_fit_with_deterministic_data_when_estimate_crps_then_returns_zero():
+    X = np.random.normal(0, 1, 1000)
+    Y = 2 * X
+
+    mdl = AdditiveNoiseModel(
+        create_linear_regressor_with_given_parameters(np.array([2]), intercept=0),
+        noise_model=ScipyDistribution(stats.norm, loc=0, scale=0),
+    )
+
+    assert crps(X, Y, mdl.draw_samples) == approx(0, abs=0.01)
+
+
+def test_given_bad_fit_with_deterministic_data_when_estimate_crps_then_returns_expected_result():
+    X = np.random.normal(0, 1, 2000)
+    Y = X
+
+    mdl = AdditiveNoiseModel(
+        create_linear_regressor_with_given_parameters(np.array([1]), intercept=0),
+        noise_model=ScipyDistribution(stats.norm, loc=0, scale=2),
+    )
+
+    assert crps(X, Y, mdl.draw_samples) == approx(0.47, abs=0.05)
+
+
+def test_given_good_fit_but_noisy_data_when_estimate_crps_then_returns_expected_result():
+    X = np.random.normal(0, 1, 2000)
+    Y = 2 * X + np.random.normal(0, 1, 2000)
+
+    mdl = AdditiveNoiseModel(
+        create_linear_regressor_with_given_parameters(np.array([2]), intercept=0),
+        noise_model=ScipyDistribution(stats.norm, loc=0, scale=1),
+    )
+
+    assert crps(X, Y, mdl.draw_samples) == approx(0.26, abs=0.05)
+
+
+def test_given_very_bad_fit_with_deterministic_data_when_estimate_crps_then_returns_expected_result():
+    X = np.random.normal(0, 1, 2000)
+    Y = X
+
+    mdl = AdditiveNoiseModel(
+        create_linear_regressor_with_given_parameters(np.array([100]), intercept=0),
+        noise_model=ScipyDistribution(stats.norm, loc=0, scale=2),
+    )
+
+    assert crps(X, Y, mdl.draw_samples) > 1
+
+
+def test_given_categorical_data_and_a_good_fit_with_deterministic_data_when_estimate_crps_then_returns_zero():
+    X = np.random.normal(0, 1, 1000)
+    Y = (X > 0).astype(str)
+
+    mdl = ClassifierFCM(create_logistic_regression_classifier())
+    mdl.fit(X, Y)
+
+    X = np.random.normal(0, 1, 1000)
+    Y = (X > 0).astype(str)
+
+    assert crps(X, Y, mdl.draw_samples) == approx(0.02, abs=0.01)
+
+
+def test_given_categorical_data_and_a_bad_fit_with_deterministic_data_when_estimate_crps_then_returns_expected_result():
+    X = np.random.normal(0, 1, 1000)
+    Y = (X > 0).astype(str)
+
+    mdl = ClassifierFCM(create_logistic_regression_classifier())
+    mdl.fit(X, Y)
+
+    X = np.random.normal(0, 1, 1000)
+    Y = ((X + 1) > 0).astype(str)
+
+    assert crps(X, Y, mdl.draw_samples) == approx(0.3, abs=0.05)
+
+
+@flaky(max_runs=3)
+def test_given_multiplicative_noise_data_when_evaluate_invertibility_assumptions_then_rejects():
+    X0 = np.random.normal(0, 1, 5000)
+    Y = X0 * np.random.normal(0, 0.1, 5000)
+    data = pd.DataFrame({"X0": X0, "Y": Y})
+
+    causal_model = InvertibleStructuralCausalModel(nx.DiGraph([("X0", "Y")]))
+
+    assign_causal_mechanisms(causal_model, data)
+    causal_model.set_causal_mechanism("Y", AdditiveNoiseModel(create_linear_regressor()))
+    fit(causal_model, data)
+
+    assert _evaluate_invertibility_assumptions(causal_model, data, kernel_based, 0.05, None, 1)["Y"][1]
+
+
+@flaky(max_runs=3)
+def test_given_additive_noise_data_when_evaluate_invertibility_assumptions_then_does_not_reject():
+    X0 = np.random.normal(0, 1, 5000)
+    Y = X0 + np.random.normal(0, 0.1, 5000)
+    data = pd.DataFrame({"X0": X0, "Y": Y})
+
+    causal_model = InvertibleStructuralCausalModel(nx.DiGraph([("X0", "Y")]))
+
+    assign_causal_mechanisms(causal_model, data)
+    causal_model.set_causal_mechanism("Y", AdditiveNoiseModel(create_linear_regressor()))
+    fit(causal_model, data)
+
+    assert not _evaluate_invertibility_assumptions(causal_model, data, kernel_based, 0.05, None, 1)["Y"][1]
+
+
+@flaky(max_runs=3)
+def test_given_continuous_data_only_when_evaluate_model_returns_expected_information():
+    X0 = np.random.normal(0, 1, 1000)
+    X1 = np.random.normal(0, 1, 1000)
+    Y = X0 + X1 + np.random.normal(0, 0.1, 1000)
+
+    data = pd.DataFrame({"X0": X0, "X1": X1, "Y": Y})
+
+    causal_model = InvertibleStructuralCausalModel(nx.DiGraph([("X0", "Y"), ("X1", "Y")]))
+
+    assign_causal_mechanisms(causal_model, data)
+    fit(causal_model, data)
+
+    summary = evaluate_causal_model(
+        causal_model,
+        data,
+        compare_mechanism_baselines=True,
+        config=EvaluateCausalModelConfig(
+            baseline_models_regression=[create_linear_regressor, create_hist_gradient_boost_regressor],
+        ),
+    )
+
+    assert summary.overall_kl_divergence == approx(0, abs=0.05)
+    assert summary.model_performances["X0"][1] == approx(0, abs=0.15)
+    assert summary.model_performances["X1"][1] == approx(0, abs=0.15)
+    assert summary.model_performances["Y"][1] == approx(0.05, abs=0.02)  # CRPS
+    assert not summary.pnl_assumptions["Y"][1]
+    assert summary.pnl_assumptions["Y"][2] == 0.05
+
+    summary.plot_falsification_histogram = False
+
+    summary_string = str(summary)
+
+    assert (
+        """Evaluated the performance of the causal mechanisms and the invertibility assumption of the causal mechanisms and the overall average KL divergence between generated and observed distribution and graph structure. The results are as follows:
+
+==== Evaluation of Causal Mechanisms ====
+Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
+Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.
+
+--- Node X0: The KL divergence between generated and observed distribution is """
+        in summary_string
+    )
+    assert "--- Node Y: The normalized CRPS of this node is " in summary_string
+    assert "The estimated CRPS indicates a very good model performance." in summary_string
+
+    assert "The mechanism is better or equally good than all " in summary_string
+    assert "==== Evaluation of Invertible Functional Causal Model Assumption ====" in summary_string
+    assert (
+        "Note that these results are based on statistical independence tests, and the fact that the assumption was "
+        "not rejected does not necessarily imply that it is correct. There is just no evidence against it."
+        in summary_string
+    )
+
+    assert "==== Evaluation of Generated Distribution ====" in summary_string
+    assert (
+        "The estimated KL divergence indicates an overall very good representation of the data distribution"
+        in summary_string
+    )
+    assert "==== Evaluation of the Causal Graph Structure ====" in summary_string
+    assert (
+        """==== NOTE ====
+Always double check the made model assumptions with respect to the graph structure and choice of causal mechanisms.
+All these evaluations give some insight into the goodness of the causal model, but should not be overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many algorithms are fairly robust against misspecifications or poor performances of causal mechanisms."""
+        in summary_string
+    )
+
+
+@flaky(max_runs=3)
+def test_given_categorical_data_only_when_evaluate_model_returns_expected_information():
+    X0 = np.random.normal(0, 1, 2000)
+    X1 = np.random.normal(0, 1, 2000)
+    Y = (X0 + X1 + np.random.normal(0, 0.1, 2000) > 0).astype(str)
+
+    data = pd.DataFrame({"X0": X0, "X1": X1, "Y": Y})
+
+    causal_model = InvertibleStructuralCausalModel(nx.DiGraph([("X0", "Y"), ("X1", "Y")]))
+
+    assign_causal_mechanisms(causal_model, data)
+    fit(causal_model, data)
+
+    summary = evaluate_causal_model(
+        causal_model,
+        data,
+        compare_mechanism_baselines=True,
+        config=EvaluateCausalModelConfig(
+            baseline_models_classification=[
+                create_logistic_regression_classifier,
+                create_hist_gradient_boost_classifier,
+            ],
+        ),
+    )
+    assert summary.overall_kl_divergence == approx(0, abs=0.05)
+    assert summary.model_performances["X0"][1] == approx(0, abs=0.15)
+    assert summary.model_performances["X1"][1] == approx(0, abs=0.15)
+    assert summary.model_performances["Y"][1] == approx(0.02, abs=0.05)  # CRPS
+    assert len(summary.pnl_assumptions) == 0
+
+    summary.plot_falsification_histogram = False
+
+    summary_string = str(summary)
+
+    assert (
+        """Evaluated the performance of the causal mechanisms and the invertibility assumption of the causal mechanisms and the overall average KL divergence between generated and observed distribution and graph structure. The results are as follows:
+
+==== Evaluation of Causal Mechanisms ====
+Root nodes are evaluated based on the KL divergence between the generated and the observed distribution.
+Non-root nodes are evaluated based on the (normalized) Continuous Ranked Probability Score (CRPS), which is a generalizes the Mean Absolute Percentage Error to probabilistic predictions. Since the causal mechanisms produce conditional distributions, this should give some insights into their performance and calibration. However, note that many algorithms are still relatively robust against poor model performances.
+
+--- Node X0: The KL divergence between generated and observed distribution is """
+        in summary_string
+    )
+    assert "--- Node Y: The normalized CRPS of this node is " in summary_string
+    assert "The estimated CRPS indicates a very good model performance." in summary_string
+    assert "The mechanism is better or equally good than all " in summary_string
+    assert "==== Evaluation of Invertible Functional Causal Model Assumption ====" in summary_string
+    assert "The causal model has no invertible causal models." in summary_string
+    assert (
+        (
+            """==== Evaluation of Generated Distribution ====
+The overall average KL divergence between the generated and observed distribution is"""
+        )
+        in summary_string
+    )
+    assert (
+        "The estimated KL divergence indicates an overall very good representation of the data distribution"
+        in summary_string
+    )
+    assert "==== Evaluation of the Causal Graph Structure ====" in summary_string
+    assert (
+        """==== NOTE ====
+Always double check the made model assumptions with respect to the graph structure and choice of causal mechanisms.
+All these evaluations give some insight into the goodness of the causal model, but should not be overinterpreted, since some causal relationships can be intrinsically hard to model. Furthermore, many algorithms are fairly robust against misspecifications or poor performances of causal mechanisms."""
+        in summary_string
+    )