Skip to content

Commit

Permalink
Improvements features task (#25) (#26)
Browse files Browse the repository at this point in the history
Add DatasetCalculationTask to wrap featuretools calculate_feature_matrix
Add new examples,
Update documentation.
Use categorical columns in pandas to avoid ft.vtypes
  • Loading branch information
echatzikyriakidis authored Jan 12, 2021
1 parent ed6b626 commit 8d3dd6b
Show file tree
Hide file tree
Showing 91 changed files with 1,551 additions and 1,451 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[![Python](https://img.shields.io/badge/python-3.6%20%7C%203.7%20%7C%203.8-blue?style=plastic)](https://www.python.org/)
[![PyPI](https://img.shields.io/badge/pypi_package-1.0.12-blue?style=plastic)](https://pypi.org/project/skrobot/1.0.12/)
[![PyPI](https://img.shields.io/badge/pypi_package-1.0.13-blue?style=plastic)](https://pypi.org/project/skrobot/1.0.13/)
[![License](https://img.shields.io/badge/license-MIT-blue?style=plastic)](https://github.com/medoidai/skrobot/blob/master/LICENSE.txt)
[![Documentation Status](https://readthedocs.org/projects/skrobot/badge/?version=1.0.12)](https://skrobot.readthedocs.io/en/1.0.12/)
[![Documentation Status](https://readthedocs.org/projects/skrobot/badge/?version=1.0.13)](https://skrobot.readthedocs.io/en/1.0.13/)

-----------------

Expand All @@ -17,7 +17,7 @@ skrobot is a Python module for designing, running and tracking Machine Learning

## Documentation?

The documentation is hosted online to [Read the Docs](https://skrobot.readthedocs.io/en/1.0.12/).
The documentation is hosted online to [Read the Docs](https://skrobot.readthedocs.io/en/1.0.13/).

## How do I install it?

Expand Down
5 changes: 5 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 1.0.13

* Add DatasetCalculationTask
* Update examples using feature synthesis

# 1.0.12

* Fix problem of feature graph filenames (in case of Windows)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
author = 'Medoid AI'

# The full version, including alpha/beta/rc tags
release = '1.0.12'
release = '1.0.13'


# -- General configuration ---------------------------------------------------
Expand Down
110 changes: 52 additions & 58 deletions docs/source/how_do_i_use_it.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
How do I use it?
================

The following examples use many of skrobot’s components to built a machine learning modelling pipeline. Please try them and we would love to have your feedback! Furthermore, many examples can be found in the project's `repository <https://github.com/medoidai/skrobot/tree/1.0.12/examples>`__.
The following examples use many of skrobot’s components to built a machine learning modelling pipeline. Please try them and we would love to have your feedback! Furthermore, many examples can be found in the project's `repository <https://github.com/medoidai/skrobot/tree/1.0.13/examples>`__.

Example on Titanic Dataset
--------------------------

The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.12/examples/experiments-output/example-titanic-pipeline-with-model-based-feature-selection>`__.
The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.13/examples/experiments-output/example-titanic-pipeline-with-model-based-feature-selection>`__.

.. code:: python
import os
import pandas as pd
import featuretools as ft
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
Expand All @@ -28,6 +28,7 @@ The following example has generated the following `results <https://github.com/m
from skrobot.tasks import EvaluationCrossValidationTask
from skrobot.tasks import HyperParametersSearchCrossValidationTask
from skrobot.tasks import DeepFeatureSynthesisTask
from skrobot.tasks import DatasetCalculationTask
from skrobot.feature_selection import ColumnSelector
from skrobot.notification import EmailNotifier
Expand All @@ -43,9 +44,9 @@ The following example has generated the following `results <https://github.com/m
columns_subset = numerical_columns + categorical_columns
raw_data_set = pd.read_csv('https://bit.ly/titanic-data-set', usecols=[id_column, label_column] + columns_subset)
raw_data_set = pd.read_csv('https://bit.ly/titanic-data-set', usecols=[id_column, label_column, *columns_subset], dtype={ c: 'category' for c in categorical_columns })
new_raw_data_set = pd.read_csv('https://bit.ly/titanic-data-new', usecols=[id_column] + columns_subset)
new_raw_data_set = pd.read_csv('https://bit.ly/titanic-data-new', usecols=[id_column, *columns_subset], dtype={ c: 'category' for c in categorical_columns })
random_seed = 42
Expand All @@ -67,31 +68,26 @@ The following example has generated the following `results <https://github.com/m
"preprocessor__numerical_transformer__imputer__strategy" : [ "mean", "median" ]
}
variable_types = { c : ft.variable_types.Numeric for c in numerical_columns }
variable_types.update({ c : ft.variable_types.Categorical for c in categorical_columns })
######### skrobot Code
# Create a Notifier
notifier = EmailNotifier(email_subject="skrobot notification",
sender_account=os.environ['EMAIL_SENDER_ACCOUNT'],
sender_password=os.environ['EMAIL_SENDER_PASSWORD'],
smtp_server=os.environ['EMAIL_SMTP_SERVER'],
smtp_port=os.environ['EMAIL_SMTP_PORT'],
recipients=os.environ['EMAIL_RECIPIENTS'])
sender_account=os.environ['EMAIL_SENDER_ACCOUNT'],
sender_password=os.environ['EMAIL_SENDER_PASSWORD'],
smtp_server=os.environ['EMAIL_SMTP_SERVER'],
smtp_port=os.environ['EMAIL_SMTP_PORT'],
recipients=os.environ['EMAIL_RECIPIENTS'])
# Build an Experiment
experiment = Experiment('experiments-output').set_source_code_file_path(__file__).set_experimenter('echatzikyriakidis').set_notifier(notifier).build()
# Run Deep Feature Synthesis Task
feature_synthesis_results = experiment.run(DeepFeatureSynthesisTask (entities={ "passengers" : (raw_data_set, id_column, None, variable_types) },
target_entity="passengers",
trans_primitives = ['add_numeric', 'multiply_numeric'],
export_feature_information=True,
export_feature_graphs=True,
id_column=id_column,
label_column=label_column))
feature_synthesis_results = experiment.run(DeepFeatureSynthesisTask (entities={ "passengers" : (raw_data_set, id_column) },
target_entity="passengers",
trans_primitives = ['add_numeric', 'multiply_numeric'],
export_feature_information=True,
export_feature_graphs=True,
label_column=label_column))
data_set = feature_synthesis_results['synthesized_dataset']
Expand All @@ -105,58 +101,56 @@ The following example has generated the following `results <https://github.com/m
preprocessor = ColumnTransformer(transformers=[
('numerical_transformer', numeric_transformer, numerical_features),
('categorical_transformer', categorical_transformer, categorical_features)
])
('categorical_transformer', categorical_transformer, categorical_features)])
# Run Feature Selection Task
features_columns = experiment.run(FeatureSelectionCrossValidationTask (estimator=classifier,
train_data_set=train_data_set,
preprocessor=preprocessor,
id_column=id_column,
label_column=label_column,
random_seed=random_seed).stratified_folds(total_folds=5, shuffle=True))
train_data_set=train_data_set,
preprocessor=preprocessor,
id_column=id_column,
label_column=label_column,
random_seed=random_seed).stratified_folds(total_folds=5, shuffle=True))
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('selector', ColumnSelector(cols=features_columns)),
('classifier', classifier)])
('selector', ColumnSelector(cols=features_columns)),
('classifier', classifier)])
# Run Hyperparameters Search Task
hyperparameters_search_results = experiment.run(HyperParametersSearchCrossValidationTask (estimator=pipe,
search_params=search_params,
train_data_set=train_data_set,
id_column=id_column,
label_column=label_column,
random_seed=random_seed).random_search(n_iters=100).stratified_folds(total_folds=5, shuffle=True))
search_params=search_params,
train_data_set=train_data_set,
id_column=id_column,
label_column=label_column,
random_seed=random_seed).random_search(n_iters=100).stratified_folds(total_folds=5, shuffle=True))
# Run Evaluation Task
evaluation_results = experiment.run(EvaluationCrossValidationTask(estimator=pipe,
estimator_params=hyperparameters_search_results['best_params'],
train_data_set=train_data_set,
test_data_set=test_data_set,
id_column=id_column,
label_column=label_column,
random_seed=random_seed,
export_classification_reports=True,
export_confusion_matrixes=True,
export_pr_curves=True,
export_roc_curves=True,
export_false_positives_reports=True,
export_false_negatives_reports=True,
export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))
estimator_params=hyperparameters_search_results['best_params'],
train_data_set=train_data_set,
test_data_set=test_data_set,
id_column=id_column,
label_column=label_column,
random_seed=random_seed,
export_classification_reports=True,
export_confusion_matrixes=True,
export_pr_curves=True,
export_roc_curves=True,
export_false_positives_reports=True,
export_false_negatives_reports=True,
export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))
# Run Train Task
train_results = experiment.run(TrainTask(estimator=pipe,
estimator_params=hyperparameters_search_results['best_params'],
train_data_set=train_data_set,
id_column=id_column,
label_column=label_column,
random_seed=random_seed))
# Run Prediction Task
new_data_set = ft.calculate_feature_matrix(feature_defs, entities={ "passengers" : (new_raw_data_set, id_column, None, variable_types) })
estimator_params=hyperparameters_search_results['best_params'],
train_data_set=train_data_set,
id_column=id_column,
label_column=label_column,
random_seed=random_seed))
new_data_set.reset_index(inplace=True)
# Run Dataset Calculation Task
new_data_set = experiment.run(DatasetCalculationTask(feature_defs, entities={ "passengers" : (new_raw_data_set, id_column) }))
# Run Prediction Task
predictions = experiment.run(PredictionTask(estimator=train_results['estimator'],
data_set=new_data_set,
id_column=id_column,
Expand Down Expand Up @@ -188,7 +182,7 @@ The following example has generated the following `results <https://github.com/m
Example on SMS Spam Collection Dataset
--------------------------------------

The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.12/examples/experiments-output/example-sms-spam-ham-pipeline-with-filtering-feature-selection>`__.
The following example has generated the following `results <https://github.com/medoidai/skrobot/tree/1.0.13/examples/experiments-output/example-sms-spam-ham-pipeline-with-filtering-feature-selection>`__.

.. code:: python
Expand Down
8 changes: 4 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ Welcome to skrobot's documentation!

.. |Python| image:: https://img.shields.io/badge/python-3.6%20%7C%203.7%20%7C%203.8-blue?style=plastic
:target: https://www.python.org/
.. |PyPI| image:: https://img.shields.io/badge/pypi_package-1.0.12-blue?style=plastic
:target: https://pypi.org/project/skrobot/1.0.12/
.. |PyPI| image:: https://img.shields.io/badge/pypi_package-1.0.13-blue?style=plastic
:target: https://pypi.org/project/skrobot/1.0.13/
.. |License| image:: https://img.shields.io/badge/license-MIT-blue?style=plastic
:target: https://github.com/medoidai/skrobot/blob/master/LICENSE.txt
.. |Documentation Status| image:: https://readthedocs.org/projects/skrobot/badge/?version=1.0.12
:target: https://skrobot.readthedocs.io/en/1.0.12/
.. |Documentation Status| image:: https://readthedocs.org/projects/skrobot/badge/?version=1.0.13
:target: https://skrobot.readthedocs.io/en/1.0.13/
.. |skrobot logo| image:: https://github.com/medoidai/skrobot/raw/master/static/skrobot-logo.png
:target: https://github.com/medoidai/skrobot/raw/master/static/skrobot-logo.png
Loading

0 comments on commit 8d3dd6b

Please sign in to comment.