diff --git a/docs/tutorials/performance_estimation/multiclass_performance_estimation.rst b/docs/tutorials/performance_estimation/multiclass_performance_estimation.rst
index 49e575b0..e87136bc 100644
--- a/docs/tutorials/performance_estimation/multiclass_performance_estimation.rst
+++ b/docs/tutorials/performance_estimation/multiclass_performance_estimation.rst
@@ -1,144 +1,29 @@
 .. _multiclass-performance-estimation:
 
-====================================================
+================================================
 Estimating Performance for Multiclass Classification
-====================================================
+================================================
 
-This tutorial explains how to use NannyML to estimate the performance of binary classification
-models in the absence of target data. To find out how :class:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE` estimates performance, read the :ref:`explanation of Confidence-based
-Performance Estimation<performance-estimation-deep-dive>`.
+We currently support the following **standard** metrics for multiclass classification performance estimation:
 
-.. note::
-    The following example uses :term:`timestamps<Timestamp>`.
-    These are optional but have an impact on the way data is chunked and results are plotted.
-    You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
+    * **roc_auc** - one-vs-the-rest, macro-averaged
+    * **f1** - macro-averaged
+    * **precision** - macro-averaged
+    * **recall** - macro-averaged
+    * **specificity** - macro-averaged
+    * **accuracy**
 
+For more information about estimating these metrics, refer to the :ref:`standard-metric-estimation` section.
 
-Just The Code
--------------
+We also support the following *complex* metrics for multiclass classification performance estimation:
 
-.. nbimport::
-    :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cells: 1 3 4 6
+    * **confusion_matrix**
 
-.. admonition:: **Advanced configuration**
-    :class: hint
+For more information about estimating the confusion matrix for multiclass problems,
+refer to the :ref:`multiclass-confusion-matrix-estimation` section.
 
-    - To learn how :class:`~nannyml.chunk.Chunk` works and to set up custom chunkings check out the :ref:`chunking tutorial <chunking>`
-    - To learn how :class:`~nannyml.thresholds.ConstantThreshold` works and to set up custom threshold check out the :ref:`thresholds tutorial <thresholds>`
+.. toctree::
+   :maxdepth: 2
 
-Walkthrough
------------
-
-
-For simplicity this guide is based on a synthetic dataset where the monitored model predicts
-which type of credit card product new customers should be assigned to.
-Check out :ref:`Credit Card Dataset<dataset-synthetic-multiclass>` to learn more about this dataset.
-
-In order to monitor a model, NannyML needs to learn about it and set expectations from a reference dataset.
-Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset.
-You can read more about this in our section on :ref:`data periods<data-drift-periods>`.
-
-.. nbimport::
-    :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cells: 1
-
-.. nbtable::
-    :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cell: 2
-
-Next we create the Confidence-based Performance Estimation (:class:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE`)
-estimator with a list of metrics, and an optional :term:`chunking<Data Chunk>` specification. For more information about
-chunking check out the :ref:`chunking tutorial<chunking>` and it's :ref:`advanced guide<chunk-data>`.
-
-.. note::
-  The list of metrics specifies which performance metrics of the monitored model will be estimated.
-  The following metrics are currently supported:
-
-  - ``roc_auc`` - one-vs-the-rest, macro-averaged
-  - ``f1`` - macro-averaged
-  - ``precision`` - macro-averaged
-  - ``recall`` - macro-averaged
-  - ``specificity`` - macro-averaged
-  - ``accuracy``
-
-
-.. nbimport::
-    :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cells: 3
-
-The :class:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE`
-estimator is then fitted using the
-:meth:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE.fit` method on the reference data.
-
-The fitted ``estimator`` can be used to estimate performance on other data, for which performance cannot be calculated.
-Typically, this would be used on the latest production data where target is missing. In our example this is
-the ``analysis_df`` data.
-
-NannyML can then output a dataframe that contains all the results. Let's have a look at the results for analysis period
-only.
-
-.. nbimport::
-    :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cells: 4
-
-.. nbtable::
-    :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cell: 5
-
-Apart from chunk-related data, the results data have the following columns for each metric
-that was estimated:
-
- - **value** - the estimate of a metric for a specific chunk.
- - **sampling_error** - the estimate of the :term:`Sampling Error`.
- - **realized** - when **target** values are available for a chunk, the realized performance metric will also
-   be calculated and included within the results.
- - **upper_confidence_boundary** and **lower_confidence_boundary** - These values show the :term:`Confidence Band` of the relevant metric
-   and are equal to estimated value +/- 3 times the estimated :term:`Sampling Error`.
- - **upper_threshold** and **lower_threshold** - crossing these thresholds will raise an alert on significant
-   performance change. The thresholds are calculated based on the actual performance of the monitored model on chunks in
-   the reference partition. By default, the thresholds are 3 standard deviations away from the mean performance calculated on
-   chunks. They are calculated during ``fit`` phase. You can also set up custom thresholds using constant or standard deviations thresholds,
-   to learn more about it check out our :ref:`tutorial on thresholds<thresholds>`.
- - **alert** - flag indicating potentially significant performance change. ``True`` if estimated performance crosses
-   upper or lower threshold.
-
-
-These results can be also plotted. Our plot contains several key elements.
-
-* The purple dashed step plot shows the estimated performance in each chunk of the analysis period. Thick squared point
-  markers indicate the middle of these chunks.
-
-* The black vertical line splits the reference and analysis periods.
-
-* The low-saturated colored area around the estimated performance indicates the :ref:`sampling error<estimation_of_standard_error>`.
-
-* The red horizontal dashed lines show upper and lower thresholds for alerting purposes.
-
-* If the estimated performance crosses the upper or lower threshold an alert is raised which is indicated with a red
-  diamond-shaped point marker in the middle of the chunk.
-
-Description of tabular results above explains how the
-:term:`confidence bands<Confidence Band>` and thresholds are calculated. Additional information is shown in the hover (these are
-interactive plots, though only static views are included here).
-
-
-.. nbimport::
-    :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cells: 6
-
-.. image:: ../../_static/tutorials/performance_estimation/multiclass_synthetic.svg
-
-Insights
---------
-
-After reviewing the performance estimation results, we should be able to see any indications of performance change that
-NannyML has detected based upon the model's inputs and outputs alone.
-
-
-What's next
------------
-
-The :ref:`Data Drift<data-drift>` functionality can help us to understand whether data drift is causing the performance problem.
-When the target values become available we can
-:ref:`compared realized and performance results<compare_estimated_and_realized_performance>`.
+   multiclass_performance_estimation/standard_metric_estimation
+   multiclass_performance_estimation/confusion_matrix_estimation
diff --git a/docs/tutorials/performance_estimation/multiclass_performance_estimation/standard_metric_estimation.rst b/docs/tutorials/performance_estimation/multiclass_performance_estimation/standard_metric_estimation.rst
index beccf876..75ae505f 100644
--- a/docs/tutorials/performance_estimation/multiclass_performance_estimation/standard_metric_estimation.rst
+++ b/docs/tutorials/performance_estimation/multiclass_performance_estimation/standard_metric_estimation.rst
@@ -1,8 +1,8 @@
 .. _multiclass_standard-metric-estimation:
 
-=====================================================================
-Estimating Standard Performance Metrics for Multiclass Classification
-=====================================================================
+====================================================
+Estimating Performance for Multiclass Classification
+====================================================
 
 This tutorial explains how to use NannyML to estimate the performance of binary classification
 models in the absence of target data. To find out how :class:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE` estimates performance, read the :ref:`explanation of Confidence-based
@@ -19,7 +19,7 @@ Just The Code
 
 .. nbimport::
     :path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb
-    :cells: 1 3 4 6 8
+    :cells: 1 3 4 6
 
 .. admonition:: **Advanced configuration**
     :class: hint
@@ -140,6 +140,5 @@ What's next
 -----------
 
 The :ref:`Data Drift<data-drift>` functionality can help us to understand whether data drift is causing the performance problem.
-When the target values become available we can use
-:ref:`realized performance calculation<multiclass-standard-metric-calculation>` to
-:ref:`compare realized and estimated confusion matrix results<compare_estimated_and_realized_performance>`.
+When the target values become available we can
+:ref:`compared realized and performance results<compare_estimated_and_realized_performance>`.