-
Notifications
You must be signed in to change notification settings - Fork 147
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
24 additions
and
140 deletions.
There are no files selected for viewing
151 changes: 18 additions & 133 deletions
151
docs/tutorials/performance_estimation/multiclass_performance_estimation.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,144 +1,29 @@ | ||
.. _multiclass-performance-estimation: | ||
|
||
==================================================== | ||
================================================ | ||
Estimating Performance for Multiclass Classification | ||
==================================================== | ||
================================================ | ||
|
||
This tutorial explains how to use NannyML to estimate the performance of binary classification | ||
models in the absence of target data. To find out how :class:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE` estimates performance, read the :ref:`explanation of Confidence-based | ||
Performance Estimation<performance-estimation-deep-dive>`. | ||
We currently support the following **standard** metrics for multiclass classification performance estimation: | ||
|
||
.. note:: | ||
The following example uses :term:`timestamps<Timestamp>`. | ||
These are optional but have an impact on the way data is chunked and results are plotted. | ||
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`. | ||
* **roc_auc** - one-vs-the-rest, macro-averaged | ||
* **f1** - macro-averaged | ||
* **precision** - macro-averaged | ||
* **recall** - macro-averaged | ||
* **specificity** - macro-averaged | ||
* **accuracy** | ||
|
||
For more information about estimating these metrics, refer to the :ref:`standard-metric-estimation` section. | ||
|
||
Just The Code | ||
------------- | ||
We also support the following *complex* metrics for multiclass classification performance estimation: | ||
|
||
.. nbimport:: | ||
:path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb | ||
:cells: 1 3 4 6 | ||
* **confusion_matrix** | ||
|
||
.. admonition:: **Advanced configuration** | ||
:class: hint | ||
For more information about estimating the confusion matrix for multiclass problems, | ||
refer to the :ref:`multiclass-confusion-matrix-estimation` section. | ||
|
||
- To learn how :class:`~nannyml.chunk.Chunk` works and to set up custom chunkings check out the :ref:`chunking tutorial <chunking>` | ||
- To learn how :class:`~nannyml.thresholds.ConstantThreshold` works and to set up custom threshold check out the :ref:`thresholds tutorial <thresholds>` | ||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
Walkthrough | ||
----------- | ||
|
||
|
||
For simplicity this guide is based on a synthetic dataset where the monitored model predicts | ||
which type of credit card product new customers should be assigned to. | ||
Check out :ref:`Credit Card Dataset<dataset-synthetic-multiclass>` to learn more about this dataset. | ||
|
||
In order to monitor a model, NannyML needs to learn about it and set expectations from a reference dataset. | ||
Then it can monitor the data that is subject to actual analysis, provided as the analysis dataset. | ||
You can read more about this in our section on :ref:`data periods<data-drift-periods>`. | ||
|
||
.. nbimport:: | ||
:path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb | ||
:cells: 1 | ||
|
||
.. nbtable:: | ||
:path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb | ||
:cell: 2 | ||
|
||
Next we create the Confidence-based Performance Estimation (:class:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE`) | ||
estimator with a list of metrics, and an optional :term:`chunking<Data Chunk>` specification. For more information about | ||
chunking check out the :ref:`chunking tutorial<chunking>` and it's :ref:`advanced guide<chunk-data>`. | ||
|
||
.. note:: | ||
The list of metrics specifies which performance metrics of the monitored model will be estimated. | ||
The following metrics are currently supported: | ||
|
||
- ``roc_auc`` - one-vs-the-rest, macro-averaged | ||
- ``f1`` - macro-averaged | ||
- ``precision`` - macro-averaged | ||
- ``recall`` - macro-averaged | ||
- ``specificity`` - macro-averaged | ||
- ``accuracy`` | ||
|
||
|
||
.. nbimport:: | ||
:path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb | ||
:cells: 3 | ||
|
||
The :class:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE` | ||
estimator is then fitted using the | ||
:meth:`~nannyml.performance_estimation.confidence_based.cbpe.CBPE.fit` method on the reference data. | ||
|
||
The fitted ``estimator`` can be used to estimate performance on other data, for which performance cannot be calculated. | ||
Typically, this would be used on the latest production data where target is missing. In our example this is | ||
the ``analysis_df`` data. | ||
|
||
NannyML can then output a dataframe that contains all the results. Let's have a look at the results for analysis period | ||
only. | ||
|
||
.. nbimport:: | ||
:path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb | ||
:cells: 4 | ||
|
||
.. nbtable:: | ||
:path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb | ||
:cell: 5 | ||
|
||
Apart from chunk-related data, the results data have the following columns for each metric | ||
that was estimated: | ||
|
||
- **value** - the estimate of a metric for a specific chunk. | ||
- **sampling_error** - the estimate of the :term:`Sampling Error`. | ||
- **realized** - when **target** values are available for a chunk, the realized performance metric will also | ||
be calculated and included within the results. | ||
- **upper_confidence_boundary** and **lower_confidence_boundary** - These values show the :term:`Confidence Band` of the relevant metric | ||
and are equal to estimated value +/- 3 times the estimated :term:`Sampling Error`. | ||
- **upper_threshold** and **lower_threshold** - crossing these thresholds will raise an alert on significant | ||
performance change. The thresholds are calculated based on the actual performance of the monitored model on chunks in | ||
the reference partition. By default, the thresholds are 3 standard deviations away from the mean performance calculated on | ||
chunks. They are calculated during ``fit`` phase. You can also set up custom thresholds using constant or standard deviations thresholds, | ||
to learn more about it check out our :ref:`tutorial on thresholds<thresholds>`. | ||
- **alert** - flag indicating potentially significant performance change. ``True`` if estimated performance crosses | ||
upper or lower threshold. | ||
|
||
|
||
These results can be also plotted. Our plot contains several key elements. | ||
|
||
* The purple dashed step plot shows the estimated performance in each chunk of the analysis period. Thick squared point | ||
markers indicate the middle of these chunks. | ||
|
||
* The black vertical line splits the reference and analysis periods. | ||
|
||
* The low-saturated colored area around the estimated performance indicates the :ref:`sampling error<estimation_of_standard_error>`. | ||
|
||
* The red horizontal dashed lines show upper and lower thresholds for alerting purposes. | ||
|
||
* If the estimated performance crosses the upper or lower threshold an alert is raised which is indicated with a red | ||
diamond-shaped point marker in the middle of the chunk. | ||
|
||
Description of tabular results above explains how the | ||
:term:`confidence bands<Confidence Band>` and thresholds are calculated. Additional information is shown in the hover (these are | ||
interactive plots, though only static views are included here). | ||
|
||
|
||
.. nbimport:: | ||
:path: ./example_notebooks/Tutorial - Estimating Performance - Multiclass Classification.ipynb | ||
:cells: 6 | ||
|
||
.. image:: ../../_static/tutorials/performance_estimation/multiclass_synthetic.svg | ||
|
||
Insights | ||
-------- | ||
|
||
After reviewing the performance estimation results, we should be able to see any indications of performance change that | ||
NannyML has detected based upon the model's inputs and outputs alone. | ||
|
||
|
||
What's next | ||
----------- | ||
|
||
The :ref:`Data Drift<data-drift>` functionality can help us to understand whether data drift is causing the performance problem. | ||
When the target values become available we can | ||
:ref:`compared realized and performance results<compare_estimated_and_realized_performance>`. | ||
multiclass_performance_estimation/standard_metric_estimation | ||
multiclass_performance_estimation/confusion_matrix_estimation |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters