Skip to content

Commit 2ef024e

Browse files
nnanstersNiels Nuyttens
andauthored
Feature/optional timestamp (#121)
* Adding chunk index to calculation / estimation results * Adapted step plots * Adapted joy plots * Adapted stacked bar plots * Added tests * Update quickstart docs to not use timestamps * Add tests for runner.py * Clean up temp dirs after testing * Move chunk date assignment to inheriting classes (each chunker should decide to support dates or not) * Display "unified" chunk index (taking reference data into account) in hover * Updated docs warning about using / not using timestamps in the corresponding code samples * Updated CHANGELOG.md * Bump version: 0.6.1 → 0.6.2 * Hopefully fix weird build error due to random rounding?? * Fix linting issue Co-authored-by: Niels Nuyttens <niels.nuyttens@nannyml.com>
1 parent 159e19f commit 2ef024e

File tree

71 files changed

+4190
-365
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+4190
-365
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.6.1
2+
current_version = 0.6.2
33
commit = True
44
tag = True
55

CHANGELOG.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,24 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [0.6.2] - 2022-09-16
8+
9+
### Changed
10+
11+
- Made the `timestamp_column_name` required by all calculators and estimators optional. The main consequences of this
12+
are plots have a chunk-index based x-axis now when no timestamp column name was given. You can also not chunk by
13+
period when the timestamp column name is not specified.
14+
15+
### Fixed
16+
17+
- Added missing `s3fs` dependency
18+
- Fixed outdated plotting kind constants in the runner (used by CLI)
19+
- Fixed some missing images and incorrect version numbers in the README, thanks [@NeoKish](https://github.com/NeoKish)!
20+
21+
### Added
22+
23+
- Added a lot of additional tests, mainly concerning plotting and the [`Runner`](nannyml/runner.py) class
24+
725
## [0.6.1] - 2022-09-09
826

927
### Changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,15 +69,15 @@ Allowing you to have the following benefits:
6969
| 🔬 **[Technical reference]** | Monitor the performance of your ML models. |
7070
| 🔎 **[Blog]** | Thoughts on post-deployment data science from the NannyML team. |
7171
| 📬 **[Newsletter]** | All things post-deployment data science. Subscribe to see the latest papers and blogs. |
72-
| 💎 **[New in v0.6.1]** | New features, bug fixes. |
72+
| 💎 **[New in v0.6.2]** | New features, bug fixes. |
7373
| 🧑‍💻 **[Contribute]** | How to contribute to the NannyML project and codebase. |
7474
| <img src="https://raw.githubusercontent.com/NannyML/nannyml/main/media/slack.png" height='15'> **[Join slack]** | Need help with your specific use case? Say hi on slack! |
7575

7676
[NannyML 101]: https://nannyml.readthedocs.io/en/stable/
7777
[Performance Estimation]: https://nannyml.readthedocs.io/en/stable/how_it_works/performance_estimation.html
7878
[Key Concepts]: https://nannyml.readthedocs.io/en/stable/glossary.html
7979
[Technical Reference]:https://nannyml.readthedocs.io/en/stable/nannyml/modules.html
80-
[New in v0.6.1]: https://github.com/NannyML/nannyml/releases/latest/
80+
[New in v0.6.2]: https://github.com/NannyML/nannyml/releases/latest/
8181
[Real World Example]: https://nannyml.readthedocs.io/en/stable/examples/california_housing.html
8282
[Blog]: https://www.nannyml.com/blog
8383
[Newsletter]: https://mailchi.mp/022c62281d13/postdeploymentnewsletter

docs/_static/quick-start-drift-distance_from_office.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-drift-gas_price_per_litre.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-drift-multivariate.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-drift-public_transportation_cost.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-drift-salary_range.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-drift-tenure.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-drift-wfh_prev_workday.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-drift-workday.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-perf-est.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/_static/quick-start-score-drift.svg

Lines changed: 1 addition & 1 deletion
Loading

docs/example_notebooks/Quickstart.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -391,7 +391,6 @@
391391
" y_pred_proba='y_pred_proba',\n",
392392
" y_pred='y_pred',\n",
393393
" y_true='work_home_actual',\n",
394-
" timestamp_column_name='timestamp',\n",
395394
" metrics=['roc_auc'],\n",
396395
" chunk_size=chunk_size,\n",
397396
" problem_type='classification_binary',\n",
@@ -427,7 +426,6 @@
427426
"# Let's initialize the object that will perform the Univariate Drift calculations\n",
428427
"univariate_calculator = nml.UnivariateStatisticalDriftCalculator(\n",
429428
" feature_column_names=feature_column_names,\n",
430-
" timestamp_column_name='timestamp',\n",
431429
" chunk_size=chunk_size\n",
432430
")\n",
433431
"univariate_calculator = univariate_calculator.fit(reference)\n",
@@ -600,7 +598,6 @@
600598
"calc = nml.StatisticalOutputDriftCalculator(\n",
601599
" y_pred='y_pred',\n",
602600
" y_pred_proba='y_pred_proba',\n",
603-
" timestamp_column_name='timestamp',\n",
604601
" problem_type='classification_binary'\n",
605602
")\n",
606603
"calc.fit(reference)\n",
@@ -626,7 +623,10 @@
626623
"outputs": [],
627624
"source": [
628625
"# Let's initialize the object that will perform Data Reconstruction with PCA\n",
629-
"rcerror_calculator = nml.DataReconstructionDriftCalculator(feature_column_names=feature_column_names, timestamp_column_name='timestamp', chunk_size=chunk_size).fit(reference_data=reference)\n",
626+
"rcerror_calculator = nml.DataReconstructionDriftCalculator(\n",
627+
" feature_column_names=feature_column_names,\n",
628+
" chunk_size=chunk_size\n",
629+
").fit(reference_data=reference)\n",
630630
"# let's see Reconstruction error statistics for all available data\n",
631631
"rcerror_results = rcerror_calculator.calculate(analysis)\n",
632632
"figure = rcerror_results.plot(kind='drift', plot_reference=True)\n",

docs/quick.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,12 @@ concepts and functionalities. If you want to know what is implemented under the
4242
visit :ref:`how it works<how_it_works>`. Finally, if you just look for examples
4343
on other datasets or ML problems look through our :ref:`examples<examples>`.
4444

45+
.. note::
46+
The following example does not use any :term:`timestamps<Timestamp>`.
47+
These are optional but have an impact on the way data is chunked and results are plotted.
48+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
49+
50+
4551

4652
-------------
4753
Just the code

docs/tutorials/data_requirements.rst

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,8 @@ Below we see the columns our dataset contains and explain their purpose.
109109
+----+------------------------+----------------+-----------------------+------------------------------+--------------------+-----------+----------+
110110

111111

112+
.. _data_requirements_columns_timestamp:
113+
112114
Timestamp
113115
^^^^^^^^^
114116

@@ -124,7 +126,24 @@ In the sample data this is the ``timestamp`` column.
124126
- *ISO 8601*, e.g. ``2021-10-13T08:47:23Z``
125127
- *Unix-epoch* in units of seconds, e.g. ``1513393355``
126128

127-
Currently required for all features of NannyML, though we are looking to drop this requirement in a future release.
129+
130+
.. warning::
131+
This column is optional. When a timestamp column is not provided, plots will no longer make use of a time based x-axis
132+
but will use the index of the chunks instead. The following plots illustrate this:
133+
134+
.. figure:: /_static/drift-guide-salary_range.svg
135+
136+
Plot using a time based X-axis
137+
138+
139+
.. figure:: /_static/quick-start-drift-salary_range.svg
140+
141+
Plot using an index based X-axis
142+
143+
144+
Some :class:`~nannyml.chunk.Chunker` classes might require the presence of a timestamp, such as the
145+
:class:`~nannyml.chunk.PeriodBasedChunker`.
146+
128147

129148
Target
130149
^^^^^^
@@ -183,7 +202,7 @@ You can see those requirements in the table below:
183202
+--------------+-------------------------------------+-------------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+
184203
| Data | Performance Estimation | Realized Performance | Univariate Feature Drift | Multivariate Feature Drift | Target Drift | Output Drift |
185204
+==============+=====================================+=====================================+===================================+===================================+===================================+===================================+
186-
| timestamp | Required (reference and analysis) | Required (reference and analysis) | Required (reference and analysis) | Required (reference and analysis) | Required (reference and analysis) | Required (reference and analysis) |
205+
| timestamp | | | | | | |
187206
+--------------+-------------------------------------+-------------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+
188207
| features | | | Required (reference and analysis) | Required (reference and analysis) | | |
189208
+--------------+-------------------------------------+-------------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+

docs/tutorials/detecting_data_drift/model_outputs/drift_detection_for_binary_classification_model_outputs.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@ If the model's population changes, then its actions will be different.
1313
The difference in actions is very important to know as soon as possible because
1414
they directly affect the business results from operating a machine learning model.
1515

16+
.. note::
17+
The following example uses :term:`timestamps<Timestamp>`.
18+
These are optional but have an impact on the way data is chunked and results are plotted.
19+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
20+
21+
1622
Just The Code
1723
------------------------------------
1824

docs/tutorials/detecting_data_drift/model_outputs/drift_detection_for_multiclass_classification_model_outputs.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@ If the model's population changes, then our populations' actions will be differe
1313
The difference in actions is very important to know as soon as possible because
1414
they directly affect the business results from operating a machine learning model.
1515

16+
.. note::
17+
The following example uses :term:`timestamps<Timestamp>`.
18+
These are optional but have an impact on the way data is chunked and results are plotted.
19+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
20+
21+
1622

1723
Just The Code
1824
------------------------------------

docs/tutorials/detecting_data_drift/model_outputs/drift_detection_for_regression_model_outputs.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@ If the model's population changes, then the outcome will be different.
1313
The difference in actions is very important to know as soon as possible because
1414
they directly affect the business results from operating a machine learning model.
1515

16+
.. note::
17+
The following example uses :term:`timestamps<Timestamp>`.
18+
These are optional but have an impact on the way data is chunked and results are plotted.
19+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
20+
21+
1622

1723
Just The Code
1824
-------------

docs/tutorials/detecting_data_drift/model_targets/drift_detection_for_binary_classification_model_targets.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ of the available target values for each chunk, for both binary and multiclass cl
2323
.. note::
2424
The Target Drift detection process can handle missing target values across all :term:`data periods<Data Period>`.
2525

26+
.. note::
27+
The following example uses :term:`timestamps<Timestamp>`.
28+
These are optional but have an impact on the way data is chunked and results are plotted.
29+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
30+
31+
2632

2733
Just The Code
2834
------------------------------------

docs/tutorials/detecting_data_drift/model_targets/drift_detection_for_multiclass_classification_model_targets.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ of the available target values for each chunk, for both binary and multiclass cl
2323
.. note::
2424
The Target Drift detection process can handle missing target values across all :term:`data periods<Data Period>`.
2525

26+
.. note::
27+
The following example uses :term:`timestamps<Timestamp>`.
28+
These are optional but have an impact on the way data is chunked and results are plotted.
29+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
30+
31+
2632

2733
Just The Code
2834
------------------------------------

docs/tutorials/detecting_data_drift/model_targets/drift_detection_for_regression_model_targets.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,12 @@ but also show the target distribution results per chunk with joyploys.
2121
.. note::
2222
The Target Drift detection process can handle missing target values across all :term:`data periods<Data Period>`.
2323

24+
.. note::
25+
The following example uses :term:`timestamps<Timestamp>`.
26+
These are optional but have an impact on the way data is chunked and results are plotted.
27+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
28+
29+
2430

2531
Just The Code
2632
-------------

docs/tutorials/performance_calculation/binary_performance_calculation.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@
44
Monitoring Realized Performance for Binary Classification
55
================================================================
66

7+
.. note::
8+
The following example uses :term:`timestamps<Timestamp>`.
9+
These are optional but have an impact on the way data is chunked and results are plotted.
10+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
11+
12+
713
Just The Code
814
==============
915

docs/tutorials/performance_calculation/multiclass_performance_calculation.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@
44
Monitoring Realized Performance for Multiclass Classification
55
================================================================
66

7+
.. note::
8+
The following example uses :term:`timestamps<Timestamp>`.
9+
These are optional but have an impact on the way data is chunked and results are plotted.
10+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
11+
12+
713

814
Just The Code
915
==============

docs/tutorials/performance_calculation/regression_performance_calculation.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@
44
Monitoring Realized Performance for Regression
55
==============================================
66

7+
.. note::
8+
The following example uses :term:`timestamps<Timestamp>`.
9+
These are optional but have an impact on the way data is chunked and results are plotted.
10+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
11+
12+
713
Just The Code
814
=============
915

docs/tutorials/performance_estimation/binary_performance_estimation.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,12 @@ This tutorial explains how to use NannyML to estimate the performance of binary
88
models in the absence of target data. To find out how CBPE estimates performance, read the :ref:`explanation of Confidence-based
99
Performance Estimation<performance-estimation-deep-dive>`.
1010

11+
.. note::
12+
The following example uses :term:`timestamps<Timestamp>`.
13+
These are optional but have an impact on the way data is chunked and results are plotted.
14+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
15+
16+
1117

1218
.. _performance-estimation-binary-just-the-code:
1319

docs/tutorials/performance_estimation/multiclass_performance_estimation.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,12 @@ This tutorial explains how to use NannyML to estimate the performance of multicl
88
models in the absence of target data. To find out how CBPE estimates performance, read the :ref:`explanation of Confidence-based
99
Performance Estimation<performance-estimation-deep-dive>`.
1010

11+
.. note::
12+
The following example uses :term:`timestamps<Timestamp>`.
13+
These are optional but have an impact on the way data is chunked and results are plotted.
14+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
15+
16+
1117
Just The Code
1218
-------------
1319

docs/tutorials/performance_estimation/regression_performance_estimation.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,12 @@ This tutorial explains how to use NannyML to estimate the performance of regress
88
models in the absence of target data. To find out how DLE estimates performance,
99
read the :ref:`explanation of how Direct Loss Estimation works<how-it-works-dle>`.
1010

11+
.. note::
12+
The following example uses :term:`timestamps<Timestamp>`.
13+
These are optional but have an impact on the way data is chunked and results are plotted.
14+
You can read more about them in the :ref:`data requirements<data_requirements_columns_timestamp>`.
15+
16+
1117
.. _performance-estimation-regression-just-the-code:
1218

1319
Just The Code

nannyml/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
# Dev branch marker is: 'X.Y.dev' or 'X.Y.devN' where N is an integer.
3333
# 'X.Y.dev0' is the canonical version of 'X.Y.dev'
3434
#
35-
__version__ = '0.6.1'
35+
__version__ = '0.6.2'
3636

3737
import logging
3838

nannyml/base.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ def __init__(
6666
chunk_number: int = None,
6767
chunk_period: str = None,
6868
chunker: Chunker = None,
69+
timestamp_column_name: Optional[str] = None,
6970
):
7071
"""Creates a new instance of an abstract DriftCalculator.
7172
@@ -83,7 +84,11 @@ def __init__(
8384
chunker : Chunker
8485
The `Chunker` used to split the data sets into a lists of chunks.
8586
"""
86-
self.chunker = ChunkerFactory.get_chunker(chunk_size, chunk_number, chunk_period, chunker)
87+
self.chunker = ChunkerFactory.get_chunker(
88+
chunk_size, chunk_number, chunk_period, chunker, timestamp_column_name
89+
)
90+
91+
self.timestamp_column_name = timestamp_column_name
8792

8893
@property
8994
def _logger(self) -> logging.Logger:
@@ -167,6 +172,7 @@ def __init__(
167172
chunk_number: int = None,
168173
chunk_period: str = None,
169174
chunker: Chunker = None,
175+
timestamp_column_name: str = None,
170176
):
171177
"""Creates a new instance of an abstract DriftCalculator.
172178
@@ -184,7 +190,10 @@ def __init__(
184190
chunker : Chunker
185191
The `Chunker` used to split the data sets into a lists of chunks.
186192
"""
187-
self.chunker = ChunkerFactory.get_chunker(chunk_size, chunk_number, chunk_period, chunker)
193+
self.chunker = ChunkerFactory.get_chunker(
194+
chunk_size, chunk_number, chunk_period, chunker, timestamp_column_name
195+
)
196+
self.timestamp_column_name = timestamp_column_name
188197

189198
@property
190199
def _logger(self) -> logging.Logger:

0 commit comments

Comments
 (0)