Skip to content

Guide on how to structure and implement machine learning pipelines.

License

Notifications You must be signed in to change notification settings

martynas-subonis/ml-workflows

Repository files navigation

ML Workflows

This repository showcases an implementation of model training pipeline. Although the project is using Kubeflow Pipelines (KFP), Google Cloud Platform (GCP) and weather image classification dataset for this case study, the focus is on demonstrating key concepts like reproducibility, artifact tracking, and automation, rather than the specifics of tools and implementation details.

The ml-workflows follows a standard project structure, as well as tooling, proposed in the repository py-manage.

Extensive write-up and discussion about approaches taken here can me found in article "ML Training Pipelines".

Table of Contents

Hypothetical Problem Statement

Let’s assume we are running a renewable energy company that seeks to optimize solar and wind farm operations across diverse geographic locations. By implementing an AI system that can automatically recognize weather conditions from images captured by on-site cameras, we can predict energy output more accurately and adjust operations in real-time. This weather recognition capability would enable more efficient resource allocation and improve overall energy production forecasting.

For this problem, we've acquired a “Weather Image Recognition” dataset as an initial dataset that we believe will meet our needs. Our goal is to create a model capable of predicting 11 distinct weather conditions: dew, fog/smog, frost, glaze, hail, lightning, rain, rainbow, rime, sandstorm, and snow. This diverse range of weather phenomena will allow our AI system to provide comprehensive insights for optimizing our renewable energy operations.

The aim of our project is to develop a robust model training pipeline that researchers and engineers can easily reuse with different runtime parameters. It should accommodate varying data sources (if somebody decides to enhance the initial dataset), data splits, random seeds, training epochs, etc. The pipeline should guarantee reproducibility and ease of artifact tracking, as well as a high level of automation.

Pipeline Overview

Pipeline Overview

The pipeline consists of four main components:

  1. Data Preparation (data_prep):
    • Splits the dataset into train, validation, and test sets.
    • Uses stratified sampling to maintain class distribution.
    • Outputs: train, validation, and test split information.
  1. Model Training (train):
    • Implements reproducibility measures.
    • Uses MobileNetV3-Small architecture with transfer learning.
    • Fine-tunes the classifier head for the specific problem domain.
    • Outputs: trained PyTorch model, ONNX model, ONNX model with transformations included as part of model graph, training metrics, and loss plot.
  1. Model Evaluation (eval):
    • Evaluates the model on the test set.
    • Calculates evaluation metrics.
    • Outputs: confusion matrix, weighted precision, recall, and F1-scores.
  1. ONNX Model Optimization (onnx_optimize):
    • A simple component which loads ONNX model with transformations, optimizes it, and produces optimized model artifact.

Each component has its own Docker image, ensuring a reproducible runtime environment through the use of Poetry and its .lock files. To guarantee consistency in results, all components employ fixed random seeds for randomized algorithms. Additionally, the training component includes extra PyTorch reproducibility measures.

By default, component result caching is enabled, meaning components are not re-executed unless their input parameters or the components themselves are modified. For example, this would allow for adjustments to the evaluation code without the need for time-consuming retraining.

Artifact Examples

Confusion Matrix

Confusion Matrix

Evaluation Metrics

Evaluation Metrics

Comparison Between Different Pipeline Runs

If we add pipeline runs of interest to Vertex AI Experiment, we can get side-by-side comparisons for free:

Comparison Parameters Comparison Scores

Prerequisites

The implementation of this pipeline template/pipeline run is coupled with GCP. In order to be able to compile/upload the pipeline template, as well as create a pipeline run from the template, one has to:

  • Have a GCP project.
  • Enable Vertex AI API.
  • Have a GCS bucket, with "Weather Image Recognition" dataset uploaded. In code, its data_bucket parameter to the pipeline, which defaults to "weather_imgs". Use your bucket name here.
  • Have a GCS staging bucket, where Kubeflow Pipelines can persist its artifacts.
  • Have a Docker type repository in Artifact Registry, where built images of components are pushed.
  • Have a Kubeflow Pipelines type repository in Artifact Registry, where templates of pipelines are pushed.
  • Have gcloud cli installed.
  • Correctly configure/authorize gcloud:
gcloud config set project $GCP_PROJECT
gcloud auth application-default login
  • Have a local .env file with the following env variables set:
KFP_REPOSITORY=  # Your Kubeflow Pipelines type repository in Artifact Registry
STAGING_BUCKET=  # Your GCS bucket, where Kubeflow Pipelines will persist its artifacts.
PREP_DATA_DOCKER_URI=  # Your URI of data_prep component Docker image. 
TRAIN_MODEL_DOCKER_URI=  # Your URI of train component Docker image.
EVAL_MODEL_DOCKER_URI=  # Your URI of eval component Docker image.
ONNX_OPTIMIZE_DOCKER_URI = # Your URI of onnx_optimize component Docker image.

Compile and Upload Pipeline Template

With the prerequisites met, compiling and uploading the pipeline template should be as simple as:

poetry install
poetry run pipeline

Creating Pipeline Run From Your Template

To create a pipeline run from the uploaded pipeline template, just follow the official Vertex AI documentation.

Run Details Runtime Configuration

Porting Pipeline Template to Different Platforms

If the new platform offers a KFP-conformant backend, porting the kfp pipeline to another platform should be straightforward. The main modification involves adjusting the data fetching logic in the data_prep, train, and eval components, where the google.cloud.storage.Client is currently used.

Appendix

Other frameworks for building machine learning workflows besides Kubeflow Pipelines:

About

Guide on how to structure and implement machine learning pipelines.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published