From 8728d77236734c6ed41c2895a5b416043530060a Mon Sep 17 00:00:00 2001 From: Nikolaos Perrakis Date: Mon, 22 Jan 2024 19:51:15 +0200 Subject: [PATCH] update docs --- docs/how_it_works/multivariate_drift.rst | 45 +++++++++++-------- .../multivariate_drift_detection/cdd.rst | 10 ++--- 2 files changed, 31 insertions(+), 24 deletions(-) diff --git a/docs/how_it_works/multivariate_drift.rst b/docs/how_it_works/multivariate_drift.rst index 1e8ae61e..fb090264 100644 --- a/docs/how_it_works/multivariate_drift.rst +++ b/docs/how_it_works/multivariate_drift.rst @@ -154,26 +154,35 @@ tutorial. Classifier for Drift Detection ------------------------------ -Classifier for drift detection is an implementation of domain classifiers, as it is called -in `relevant literature`_. NannyML uses a LightGBM classifier to distinguish between -the reference data and the examined chunk data. Similar to data reconstruction with PCA -this method is also able to capture complex changes in our data. The algorithm implementing -Classifier for Drift Detection follows the steps described below. +Classifier for drift detection provides a measure of how easy it is to discriminate +the reference data from the examined chunk data. It is an implementation of domain classifiers, as +they are called in `relevant literature`_, using a LightGBM classifier. +As a measure of discrimination performance NannyML uses the cross-validated AUROC score. +Similar to data reconstruction with PCA this method is also able to capture complex changes in our data. +The algorithm implementing Classifier for Drift Detection follows the steps described below. Please note that the process described below is repeated for each :term:`Data Chunk`. -First, we prepare the data by assigning label 0 to reference data and label 1 to chunk data. -We use the model inputs as features and concatenate the reference and chunk data. -Duplicate rows are removed once, keeping the one coming from the chunk data. -This ensures that when we estimate on reference data, we get meaningful results. -Finally, categorical data are encoded as integers, since this works well with LightGBM. - -To evaluate the domain classifier's discrimination performance, we use its cross-validated AUROC score. -We follow these steps to do so: First, we optionally perform hyperparameter tuning. -We perform hyperparameter optimization once on the combined data and store the resulting optimal hyperparameters. -Users can also provide hyperparameters. If nothing is specified, LightGBM defaults are used. -Next, we use sklearn's `StratifiedKFold` to split the data. For each fold split, -we train an `LGBMClassifier` and save its predicted score in the validation fold. -Finally, we use the predictions across all folds to calculate the resulting AUROC score +The process consists of two basic parts, data preprocessing and classifier cross validation. + +The data pre-processing part consists of the following steps: + +- Assigning label 0 to reference data and label 1 to chunk data rows. +- Use the model inputs as features. +- Concatenate resulting data. +- Remove duplicate rows once. We are keeping the rows comping from chunk data in + order to get meaningful results when we use the method on reference data chunks. +- Encode categorical data as integers for better compatibility with LightGBM. + +The classifier cross validation part consists of the following steps: + +- Hyperparameter tuning. This step is optional. It uses the dataset created from the previous + step and stores the resulting optimal hyperparameters. +- If hyperparameter tuning is not requested, user specified hyperpatameters can be used + instead of the default LightGBM optioms. +- sklearn's `StratifiedKFold` is used to split the data into folds. +- For each validation fold split NannyML trains an `LGBMClassifier` and save its predicted + score in the validation fold. +- The predictions across all folds are used to calculate the resulting AUROC score. The higher the AUROC score the easier it is to distinguish the datasets, hence the more different they are. diff --git a/docs/tutorials/detecting_data_drift/multivariate_drift_detection/cdd.rst b/docs/tutorials/detecting_data_drift/multivariate_drift_detection/cdd.rst index 54a9bfb1..d902473f 100644 --- a/docs/tutorials/detecting_data_drift/multivariate_drift_detection/cdd.rst +++ b/docs/tutorials/detecting_data_drift/multivariate_drift_detection/cdd.rst @@ -5,12 +5,10 @@ Classifier for Drift Detection ============================== The second multivariate drift detection method of NannyML is Classifier for Drift Detection. -This method trains a classification model to differentiate between data from the reference -dataset and the chunk dataset. Cross Validation is used for training. -The discriminator's performance, measured by AUROC, on the cross valdated folds is -the multivariate drift measure. When there is no data drift the datasets -can't discerned and we get a value of 0.5. The more drift there is, the higher -the returned measure will be. +It provides a measure of how easy it is to discriminate the reference data from the examined chunk data. +You can read more about on the :ref:`How it works: Classifier for Drift Detection` section. +When there is no data drift the datasets can't discerned and we get a value of 0.5. +The more drift there is, the higher the returned measure will be, up to a value of 1. Just The Code -------------