Skip to content

Latest commit

 

History

History
100 lines (67 loc) · 3.62 KB

README.md

File metadata and controls

100 lines (67 loc) · 3.62 KB

kedro-starters-sklearn

This repository provides the following starter templates for Kedro 0.18.14.

  • sklearn-iris trains a Logistic Regression model using Scikit-learn.
  • sklearn-mlflow-iris adds experiment tracking feature using MLflow.

Pipeline visualized by Kedro-viz

sklearn-iris template

Iris dataset

Iris dataset is included and used in default.

  • Modification: for each species, setosa is encoded to 0, versicolor is encoded to 1, and virginica samples were removed.
  • Split: for each species, first 25 samples were included in train.csv, and last 25 samples were included in test.csv.

How to use

  1. Install dependencies.

    pip install 'kedro==0.18.14' pandas scikit-learn 
  2. Generate your Kedro starter project from sklearn-iris directory.

    kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-iris

    As explained by Kedro's documentaion, enter project_name, repo_name, and python_package.

    Note: As your Python package name, choose a unique name and avoid a generic name such as "test" or "sklearn" used by another package. You can see the list of importable packages by running python -c "help('modules')".

  3. Change the current directory to the generated project directory.

    cd /path/to/project/directory
  4. Run the project.

    kedro run

Option to use Kaggle Titanic dataset

  1. Download Kaggle Titanic dataset
  2. Replace train.csv and test.csv in /path/to/project/directory/data/01_raw directory
  3. Modify /path/to/project/directory/base/parameters.yml to set parameters appropriate for the dataset (commented out in default)

sklearn-mlflow-iris template

This template integrates MLflow to Kedro using PipelineX. Even without writing MLflow code. You can:

  • configure MLflow Tracking
  • log inputs and outputs of Python functions set up as Kedro nodes as parameters (e.g. features used to train the model) and metrics (e.g. F1 score).
  • log execution time for each Kedro node and DataSet loading/saving as metrics.
  • log artifacts (e.g. models, execution time Gantt Chart visualized by Plotly, parameters.yml file)

In this template, MLflow logging is configured in Python code at src/<python_package>/mlflow/mlflow_config.py

See here for details.

How to use

  1. Install dependencies.

    pip install 'kedro==0.18.14' pandas scikit-learn mlflow 'pipelinex>=0.7.7' plotly
  2. Generate your Kedro starter project from sklearn-mlflow-iris directory.

    kedro new --starter https://github.com/Minyus/kedro-starters-sklearn.git --directory sklearn-mlflow-iris
  3. Follow the same steps as sklearn-iris template.

Access MLflow web UI

To access the MLflow web UI, launch the MLflow server.

mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri sqlite:///mlruns/sqlite.db --default-artifact-root ./mlruns

Logged metrics shown in MLflow's UI

Gantt chart for execution time, generated using Plotly, shown in MLflow's UI