Improvements features task (#25) (#26)

Add DatasetCalculationTask to wrap featuretools calculate_feature_matrix Add new examples, Update documentation. Use categorical columns in pandas to avoid ft.vtypes
medoidai · Jan 12, 2021 · 8d3dd6b · 8d3dd6b
1 parent ed6b626
commit 8d3dd6b
Show file tree

Hide file tree

Showing 91 changed files with 1,551 additions and 1,451 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 [![Python](https://img.shields.io/badge/python-3.6%20%7C%203.7%20%7C%203.8-blue?style=plastic)](https://www.python.org/)
-[![PyPI](https://img.shields.io/badge/pypi_package-1.0.12-blue?style=plastic)](https://pypi.org/project/skrobot/1.0.12/)
+[![PyPI](https://img.shields.io/badge/pypi_package-1.0.13-blue?style=plastic)](https://pypi.org/project/skrobot/1.0.13/)
 [![License](https://img.shields.io/badge/license-MIT-blue?style=plastic)](https://github.com/medoidai/skrobot/blob/master/LICENSE.txt)
-[![Documentation Status](https://readthedocs.org/projects/skrobot/badge/?version=1.0.12)](https://skrobot.readthedocs.io/en/1.0.12/)
+[![Documentation Status](https://readthedocs.org/projects/skrobot/badge/?version=1.0.13)](https://skrobot.readthedocs.io/en/1.0.13/)
 
 -----------------
 
@@ -17,7 +17,7 @@ skrobot is a Python module for designing, running and tracking Machine Learning
 
 ## Documentation?
 
-The documentation is hosted online to [Read the Docs](https://skrobot.readthedocs.io/en/1.0.12/).
+The documentation is hosted online to [Read the Docs](https://skrobot.readthedocs.io/en/1.0.13/).
 
 ## How do I install it?
 

diff --git a/RELEASE.md b/RELEASE.md
@@ -1,3 +1,8 @@
+# 1.0.13
+
+* Add DatasetCalculationTask
+* Update examples using feature synthesis
+
 # 1.0.12
 
 * Fix problem of feature graph filenames (in case of Windows)

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -21,7 +21,7 @@
 author = 'Medoid AI'
 
 # The full version, including alpha/beta/rc tags
-release = '1.0.12'
+release = '1.0.13'
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/docs/source/how_do_i_use_it.rst b/docs/source/how_do_i_use_it.rst
@@ -1,18 +1,18 @@
 How do I use it?
 ================
 
-The following examples use many of skrobot’s components to built a machine learning modelling pipeline. Please try them and we would love to have your feedback! Furthermore, many examples can be found in the project's `repository <https://github.com/medoidai/skrobot/tree/1.0.12/examples>`__.
+The following examples use many of skrobot’s components to built a machine learning modelling pipeline. Please try them and we would love to have your feedback! Furthermore, many examples can be found in the project's `repository <https://github.com/medoidai/skrobot/tree/1.0.13/examples>`__.
 
 Example on Titanic Dataset
 --------------------------
 
-The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.12/examples/experiments-output/example-titanic-pipeline-with-model-based-feature-selection>`__.
+The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.13/examples/experiments-output/example-titanic-pipeline-with-model-based-feature-selection>`__.
 
 .. code:: python
 
     import os
+
     import pandas as pd
-    import featuretools as ft
 
     from sklearn.compose import ColumnTransformer
     from sklearn.pipeline import Pipeline
@@ -28,6 +28,7 @@ The following example has generated the following `results <https://github.com/m
     from skrobot.tasks import EvaluationCrossValidationTask
     from skrobot.tasks import HyperParametersSearchCrossValidationTask
     from skrobot.tasks import DeepFeatureSynthesisTask
+    from skrobot.tasks import DatasetCalculationTask
     from skrobot.feature_selection import ColumnSelector
     from skrobot.notification import EmailNotifier
 
@@ -43,9 +44,9 @@ The following example has generated the following `results <https://github.com/m
 
     columns_subset = numerical_columns + categorical_columns
 
-    raw_data_set = pd.read_csv('https://bit.ly/titanic-data-set', usecols=[id_column, label_column] + columns_subset)
+    raw_data_set = pd.read_csv('https://bit.ly/titanic-data-set', usecols=[id_column, label_column, *columns_subset], dtype={ c: 'category' for c in categorical_columns })
 
-    new_raw_data_set = pd.read_csv('https://bit.ly/titanic-data-new', usecols=[id_column] + columns_subset)
+    new_raw_data_set = pd.read_csv('https://bit.ly/titanic-data-new', usecols=[id_column, *columns_subset], dtype={ c: 'category' for c in categorical_columns })
 
     random_seed = 42
 
@@ -67,31 +68,26 @@ The following example has generated the following `results <https://github.com/m
         "preprocessor__numerical_transformer__imputer__strategy" : [ "mean", "median" ]
     }
 
-    variable_types = { c : ft.variable_types.Numeric for c in numerical_columns }
-
-    variable_types.update({ c : ft.variable_types.Categorical for c in categorical_columns })
-
     ######### skrobot Code
 
     # Create a Notifier
     notifier = EmailNotifier(email_subject="skrobot notification",
-                            sender_account=os.environ['EMAIL_SENDER_ACCOUNT'],
-                            sender_password=os.environ['EMAIL_SENDER_PASSWORD'],
-                            smtp_server=os.environ['EMAIL_SMTP_SERVER'],
-                            smtp_port=os.environ['EMAIL_SMTP_PORT'],
-                            recipients=os.environ['EMAIL_RECIPIENTS'])
+                             sender_account=os.environ['EMAIL_SENDER_ACCOUNT'],
+                             sender_password=os.environ['EMAIL_SENDER_PASSWORD'],
+                             smtp_server=os.environ['EMAIL_SMTP_SERVER'],
+                             smtp_port=os.environ['EMAIL_SMTP_PORT'],
+                             recipients=os.environ['EMAIL_RECIPIENTS'])
 
     # Build an Experiment
     experiment = Experiment('experiments-output').set_source_code_file_path(__file__).set_experimenter('echatzikyriakidis').set_notifier(notifier).build()
 
     # Run Deep Feature Synthesis Task
-    feature_synthesis_results = experiment.run(DeepFeatureSynthesisTask (entities={ "passengers" : (raw_data_set, id_column, None, variable_types) },
-                                                                        target_entity="passengers",
-                                                                        trans_primitives = ['add_numeric', 'multiply_numeric'],
-                                                                        export_feature_information=True,
-                                                                        export_feature_graphs=True,
-                                                                        id_column=id_column,
-                                                                        label_column=label_column))
+    feature_synthesis_results = experiment.run(DeepFeatureSynthesisTask (entities={ "passengers" : (raw_data_set, id_column) },
+                                                                         target_entity="passengers",
+                                                                         trans_primitives = ['add_numeric', 'multiply_numeric'],
+                                                                         export_feature_information=True,
+                                                                         export_feature_graphs=True,
+                                                                         label_column=label_column))
 
     data_set = feature_synthesis_results['synthesized_dataset']
 
@@ -105,58 +101,56 @@ The following example has generated the following `results <https://github.com/m
 
     preprocessor = ColumnTransformer(transformers=[
         ('numerical_transformer', numeric_transformer, numerical_features),
-        ('categorical_transformer', categorical_transformer, categorical_features)
-    ])
+        ('categorical_transformer', categorical_transformer, categorical_features)])
 
     # Run Feature Selection Task
     features_columns = experiment.run(FeatureSelectionCrossValidationTask (estimator=classifier,
-                                                                        train_data_set=train_data_set,
-                                                                        preprocessor=preprocessor,
-                                                                        id_column=id_column,
-                                                                        label_column=label_column,
-                                                                        random_seed=random_seed).stratified_folds(total_folds=5, shuffle=True))
+                                                                           train_data_set=train_data_set,
+                                                                           preprocessor=preprocessor,
+                                                                           id_column=id_column,
+                                                                           label_column=label_column,
+                                                                           random_seed=random_seed).stratified_folds(total_folds=5, shuffle=True))
 
     pipe = Pipeline(steps=[('preprocessor', preprocessor),
-                        ('selector', ColumnSelector(cols=features_columns)),
-                        ('classifier', classifier)])
+                           ('selector', ColumnSelector(cols=features_columns)),
+                           ('classifier', classifier)])
 
     # Run Hyperparameters Search Task
     hyperparameters_search_results = experiment.run(HyperParametersSearchCrossValidationTask (estimator=pipe,
-                                                                                            search_params=search_params,
-                                                                                            train_data_set=train_data_set,
-                                                                                            id_column=id_column,
-                                                                                            label_column=label_column,
-                                                                                            random_seed=random_seed).random_search(n_iters=100).stratified_folds(total_folds=5, shuffle=True))
+                                                                                              search_params=search_params,
+                                                                                              train_data_set=train_data_set,
+                                                                                              id_column=id_column,
+                                                                                              label_column=label_column,
+                                                                                              random_seed=random_seed).random_search(n_iters=100).stratified_folds(total_folds=5, shuffle=True))
 
     # Run Evaluation Task
     evaluation_results = experiment.run(EvaluationCrossValidationTask(estimator=pipe,
-                                                                    estimator_params=hyperparameters_search_results['best_params'],
-                                                                    train_data_set=train_data_set,
-                                                                    test_data_set=test_data_set,
-                                                                    id_column=id_column,
-                                                                    label_column=label_column,
-                                                                    random_seed=random_seed,
-                                                                    export_classification_reports=True,
-                                                                    export_confusion_matrixes=True,
-                                                                    export_pr_curves=True,
-                                                                    export_roc_curves=True,
-                                                                    export_false_positives_reports=True,
-                                                                    export_false_negatives_reports=True,
-                                                                    export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))
+                                                                      estimator_params=hyperparameters_search_results['best_params'],
+                                                                      train_data_set=train_data_set,
+                                                                      test_data_set=test_data_set,
+                                                                      id_column=id_column,
+                                                                      label_column=label_column,
+                                                                      random_seed=random_seed,
+                                                                      export_classification_reports=True,
+                                                                      export_confusion_matrixes=True,
+                                                                      export_pr_curves=True,
+                                                                      export_roc_curves=True,
+                                                                      export_false_positives_reports=True,
+                                                                      export_false_negatives_reports=True,
+                                                                      export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))
 
     # Run Train Task
     train_results = experiment.run(TrainTask(estimator=pipe,
-                                            estimator_params=hyperparameters_search_results['best_params'],
-                                            train_data_set=train_data_set,
-                                            id_column=id_column,
-                                            label_column=label_column,
-                                            random_seed=random_seed))
-
-    # Run Prediction Task
-    new_data_set = ft.calculate_feature_matrix(feature_defs, entities={ "passengers" : (new_raw_data_set, id_column, None, variable_types) })
+                                             estimator_params=hyperparameters_search_results['best_params'],
+                                             train_data_set=train_data_set,
+                                             id_column=id_column,
+                                             label_column=label_column,
+                                             random_seed=random_seed))
 
-    new_data_set.reset_index(inplace=True)
+    # Run Dataset Calculation Task
+    new_data_set = experiment.run(DatasetCalculationTask(feature_defs, entities={ "passengers" : (new_raw_data_set, id_column) }))
 
+    # Run Prediction Task
     predictions = experiment.run(PredictionTask(estimator=train_results['estimator'],
                                                 data_set=new_data_set,
                                                 id_column=id_column,
@@ -188,7 +182,7 @@ The following example has generated the following `results <https://github.com/m
 Example on SMS Spam Collection Dataset
 --------------------------------------
 
-The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.12/examples/experiments-output/example-sms-spam-ham-pipeline-with-filtering-feature-selection>`__.
+The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.13/examples/experiments-output/example-sms-spam-ham-pipeline-with-filtering-feature-selection>`__.
 
 .. code:: python
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -21,11 +21,11 @@ Welcome to skrobot's documentation!
 
 .. |Python| image:: https://img.shields.io/badge/python-3.6%20%7C%203.7%20%7C%203.8-blue?style=plastic
    :target: https://www.python.org/
-.. |PyPI| image:: https://img.shields.io/badge/pypi_package-1.0.12-blue?style=plastic
-   :target: https://pypi.org/project/skrobot/1.0.12/
+.. |PyPI| image:: https://img.shields.io/badge/pypi_package-1.0.13-blue?style=plastic
+   :target: https://pypi.org/project/skrobot/1.0.13/
 .. |License| image:: https://img.shields.io/badge/license-MIT-blue?style=plastic
    :target: https://github.com/medoidai/skrobot/blob/master/LICENSE.txt
-.. |Documentation Status| image:: https://readthedocs.org/projects/skrobot/badge/?version=1.0.12
-   :target: https://skrobot.readthedocs.io/en/1.0.12/
+.. |Documentation Status| image:: https://readthedocs.org/projects/skrobot/badge/?version=1.0.13
+   :target: https://skrobot.readthedocs.io/en/1.0.13/
 .. |skrobot logo| image:: https://github.com/medoidai/skrobot/raw/master/static/skrobot-logo.png
    :target: https://github.com/medoidai/skrobot/raw/master/static/skrobot-logo.png