orphan: |
---|
.. currentmodule:: dirty_cat
- :func:`fuzzy_join` and :class:`FeatureAugmenter` now perform joins on missing values as in pandas.merge
but raises a warning. :pr:`522` and :pr:`529` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Improvement of date column detection and date format inference in :class:`TableVectorizer`. The
format inference now finds a format which works for all non-missing values of the column, instead of relying on pandas behavior. If no such format exists, the column is not casted to a date column. :pr:`543` by :user:`Leo Grinsztajn <LeoGrin>`
- SuperVectorizer is renamed as :class:`TableVectorizer`, a warning is raised when using the old name.
:pr:`484` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based on string similarities and the nearest neighbors matches are found for each category. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
- New experimental feature: :class:`FeatureAugmenter`, a transformer that augments with :func:`fuzzy_join` the number of features in a main table by using information from auxilliary tables. :pr:`409` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- Unnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`
- The :class:`MinHashEncoder` now supports a n_jobs parameter to parallelize the hashes computation. :pr:`267` by :user:`Leo Grinsztajn <LeoGrin>` and :user:`Lilian Boulard <LilianBoulard>`.
- New experimental feature: deduplicating misspelled categories using :func:`deduplicate` by clustering string distances. This function works best when there are significantly more duplicates than underlying categories. :pr:`339` by :user:`Moritz Boos <mjboos>`.
Add example Wikipedia embeddings to enrich the data. :pr:`487` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- datasets.fetching: contains a new function :func:`get_ken_embeddings` that can be used to download Wikipedia
embeddings and filter them by type.
datasets.fetching: contains a new function :func:`fetch_world_bank_indicator` that can be used to download indicators from the World Bank Open Data platform. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
Removed example Fitting scalable, non-linear models on data with dirty categories. :pr:`386` by :user:`Jovan Stojanovic <jovan-stojanovic>`
:class:`MinHashEncoder`'s :func:`minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`
Fetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets. :pr:`432` by :user:`Lilian Boulard <LilianBoulard>`Fetching functions now have an additional argument
directory
, which can be used to specify where to save and load from datasets. :pr:`432` and :pr:`453` by :user:`Lilian Boulard <LilianBoulard>`The :class:`TableVectorizer`'s default OneHotEncoder for low cardinality categorical variables now defaults to handle_unknown="ignore" instead of handle_unknown="error" (for sklearn >= 1.0.0). This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. :pr:`473` by :user:`Leo Grinsztajn <LeoGrin>`
- The :class:`MinHashEncoder` now considers None and empty strings as missing values, rather than raising an error. :pr:`378` by :user:`Gael Varoquaux <GaelVaroquaux>`
New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the :class:`TableVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`
The :class:`TableVectorizer` has seen some major improvements and bug fixes:
- Fixes the automatic casting logic in
transform
. - To avoid dimensionality explosion when a feature has two unique values, the default encoder (:class:`~sklearn.preprocessing.OneHotEncoder`) now drops one of the two vectors (see parameter drop="if_binary").
fit_transform
andtransform
can now return unencoded features, like the :class:`~sklearn.compose.ColumnTransformer`'s behavior. Previously, aRuntimeError
was raised.
- Fixes the automatic casting logic in
Backward-incompatible change in the TableVectorizer: To apply
remainder
to features (with the*_transformer
parameters), the value'remainder'
must be passed, instead ofNone
in previous versions.None
now indicates that we want to use the default transformer. :pr:`303` by :user:`Lilian Boulard <LilianBoulard>`Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. :pr:`289` by :user:`Lilian Boulard <LilianBoulard>`
Bumped minimum dependencies:
- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0 :pr:`299` and :pr:`300` by :user:`Lilian Boulard <LilianBoulard>`
Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
- The :class:`SimilarityEncoder` now exclusively uses
ngram
for similarities, and the similarity parameter is deprecated. It will be removed in 0.5. :pr:`282` by :user:`Lilian Boulard <LilianBoulard>`
- The :class:`SimilarityEncoder` now exclusively uses
- The
transformers_
attribute of the :class:`TableVectorizer` now contains column names instead of column indices for the "remainder" columns. :pr:`266` by :user:`Leo Grinsztajn <LeoGrin>`
- Fixed a bug in the :class:`TableVectorizer` causing a :class:`FutureWarning` when using the :func:`get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`
Improvements to the :class:`TableVectorizer`
- Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
:func:`get_feature_names` becomes :func:`get_feature_names_out`, following changes in the scikit-learn API. :func:`get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`
- Improvements to the :class:`MinHashEncoder`
- It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`. Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` function, on multiple columns.
- Fixed a bug that resulted in the :class:`GapEncoder` ignoring the analyzer argument. :pr:`242` by :user:`Jovan Stojanovic <jovan-stojanovic>`
- :class:`GapEncoder`'s get_feature_names_out now accepts all iterators, not just lists. :pr:`255` by :user:`Lilian Boulard <LilianBoulard>`
- Fixed :class:`DeprecationWarning` raised by the usage of distutils.version.LooseVersion. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`
- Remove trailing imports in the :class:`MinHashEncoder`.
- Fix typos and update links for website.
- Documentation of the :class:`TableVectorizer` and the :class:`SimilarityEncoder` improved.
Also see pre-release 0.2.0a1 below for additional changes.
Bump minimum dependencies:
- scikit-learn (>=0.21.0) :pr:`202` by :user:`Lilian Boulard <LilianBoulard>`
- pandas (>=1.1.5) ! NEW REQUIREMENT ! :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`
datasets.fetching - backward-incompatible changes to the example datasets fetchers:
- The backend has changed: we now exclusively fetch the datasets from OpenML. End users should not see any difference regarding this.
- The frontend, however, changed a little: the fetching functions stay the same but their return values were modified in favor of a more Pythonic interface. Refer to the docstrings of functions dirty_cat.datasets.fetch_* for more information.
- The example notebooks were updated to reflect these changes. :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`
Backward incompatible change to :class:`MinHashEncoder`: The :class:`MinHashEncoder` now only supports two dimensional inputs of shape (N_samples, 1). :pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.
Update handle_missing parameters:
- :class:`GapEncoder`: the default value "zero_impute" becomes "empty_impute" (see doc).
- :class:`MinHashEncoder`: the default value "" becomes "zero_impute" (see doc).
Add a method "get_feature_names_out" for the :class:`GapEncoder` and the :class:`TableVectorizer`, since get_feature_names will be depreciated in scikit-learn 1.2. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`
Removed hard-coded CSV file dirty_cat/data/FiveThirtyEight_Midwest_Survey.csv.
Improvements to the :class:`TableVectorizer`
- Missing values are not systematically imputed anymore
- Type casting and per-column imputation are now learnt during fitting
- Several bugfixes
Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:
pip install --pre dirty_cat==0.2.0a1
or from the GitHub repository:
pip install git+https://github.com/dirty-cat/dirty_cat.git
- Bump minimum dependencies:
- Python (>= 3.6)
- NumPy (>= 1.16)
- SciPy (>= 1.2)
- scikit-learn (>= 0.20.0)
- :class:`TableVectorizer`: Added automatic transform through the :class:`TableVectorizer` class. It transforms columns automatically based on their type. It provides a replacement for scikit-learn's :class:`~sklearn.compose.ColumnTransformer` simpler to use on heterogeneous pandas DataFrame. :pr:`167` by :user:`Lilian Boulard <LilianBoulard>`
- Backward incompatible change to :class:`GapEncoder`: The :class:`GapEncoder` now only supports two-dimensional inputs of shape (n_samples, n_features). Internally, features are encoded by independent :class:`GapEncoder` models, and are then concatenated into a single matrix. :pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.
- Fix get_feature_names for scikit-learn > 0.21. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`
- RuntimeWarnings due to overflow in :class:`GapEncoder`. :pr:`161` by :user:`Alexis Cvetkov <alexis-cvetkov>`
- :class:`GapEncoder`: Added online Gamma-Poisson factorization through the :class:`GapEncoder` class. This method discovers latent categories formed via combinations of substrings, and encodes string data as combinations of these categories. To be used if interpretability is important. :pr:`153` by :user:`Alexis Cvetkov <alexis-cvetkov>`
- Multiprocessing exception in notebook. :pr:`154` by :user:`Lilian Boulard <LilianBoulard>`
- MinHashEncoder: Added
minhash_encoder.py
andfast_hast.py
files that implement minhash encoding through the :class:`MinHashEncoder` class. This method allows for fast and scalable encoding of string categorical variables. - datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
- The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
- The field "description" has been renamed to "DESCR".
- SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a
similarity metric. Our implementation now accurately reproduces the behaviour
of the
python-Levenshtein
implementation. - SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
- TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
- MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.
- SimilarityEncoder: Accelerate
SimilarityEncoder.transform
, by:- computing the vocabulary count vectors in
fit
instead oftransform
- computing the similarities in parallel using
joblib
. This option can be turned on/off via then_jobs
attribute of the :class:`SimilarityEncoder`.
- computing the vocabulary count vectors in
- SimilarityEncoder: Fix a bug that was preventing a :class:`SimilarityEncoder`
to be created when
categories
was a list. - SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.
- SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
- SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
- SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
- SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
- SimilarityEncoder: Performance improvements in the ngram similarity.
- SimilarityEncoder: Expose a get_feature_names method.