This repository provides an example end-to-end machine learning pipeline on AWS build using the Sagemaker Python SDK. It leans on other resources (e.g. here and here), however, it provides a unified end-to-end example in a notebook from data processing to deployment of a REST API. This not production ready, but it will give you a good primary intuition how to orchestrate the ml lifecycle on AWS via the Sagemaker SDK.
The main ressource for this guid is the notebook ml_pipeline.ipynb
in the folder notebooks
. The easiest way to follow along the tutorial would be to launch a notebook instance on AWS Sagemaker and pull the repository into your jupyterlab environment.
The Penguin Dataset from Alison Horst is an alternative to the famous iris dataset that can be used for demonstrating various ml tasks. Read more here.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
3 | Adelie | Torgersen | 40.3 | 18 | 195 | 3250 | female | 2007 |
The goal is to train a classifier that predicts the sex/gender of a penguin based on all other variables available.
- stored in
/notebooks
eda.ipynb
visual exploration of the dataml_pipeline.ipynb
orchestrates preprocessing of the data, model training and deployment of the model as endpoint
- head over to
notebooks.ml_pipeline.ipynb
and follow the procedure