This repository showcases an implementation of model training pipeline. Although the project is using Kubeflow Pipelines (KFP), Google Cloud Platform (GCP) and weather image classification dataset for this case study, the focus is on demonstrating key concepts like reproducibility, artifact tracking, and automation, rather than the specifics of tools and implementation details.
The ml-workflows
follows a standard
project structure, as well as tooling, proposed in the
repository py-manage.
Extensive write-up and discussion about approaches taken here can me found in article "ML Training Pipelines".
- Hypothetical Problem Statement
- Pipeline Overview
- Artifact Examples
- Prerequisites
- Compile and Upload Pipeline Template
- Creating Pipeline Run From Your Template
- Porting Pipeline Template to Different Platforms
- Appendix
Let’s assume we are running a renewable energy company that seeks to optimize solar and wind farm operations across diverse geographic locations. By implementing an AI system that can automatically recognize weather conditions from images captured by on-site cameras, we can predict energy output more accurately and adjust operations in real-time. This weather recognition capability would enable more efficient resource allocation and improve overall energy production forecasting.
For this problem, we've acquired a “Weather Image Recognition” dataset as an initial dataset that we believe will meet our needs. Our goal is to create a model capable of predicting 11 distinct weather conditions: dew, fog/smog, frost, glaze, hail, lightning, rain, rainbow, rime, sandstorm, and snow. This diverse range of weather phenomena will allow our AI system to provide comprehensive insights for optimizing our renewable energy operations.
The aim of our project is to develop a robust model training pipeline that researchers and engineers can easily reuse with different runtime parameters. It should accommodate varying data sources (if somebody decides to enhance the initial dataset), data splits, random seeds, training epochs, etc. The pipeline should guarantee reproducibility and ease of artifact tracking, as well as a high level of automation.
The pipeline consists of four main components:
- Data Preparation (data_prep):
-
- Splits the dataset into train, validation, and test sets.
-
- Uses stratified sampling to maintain class distribution.
-
- Outputs: train, validation, and test split information.
- Model Training (train):
-
- Implements reproducibility measures.
-
- Uses MobileNetV3-Small architecture with transfer learning.
-
- Fine-tunes the classifier head for the specific problem domain.
-
- Outputs: trained PyTorch model, ONNX model, ONNX model with transformations included as part of model graph, training metrics, and loss plot.
- Model Evaluation (eval):
-
- Evaluates the model on the test set.
-
- Calculates evaluation metrics.
-
- Outputs: confusion matrix, weighted precision, recall, and F1-scores.
- ONNX Model Optimization (onnx_optimize):
-
- A simple component which loads ONNX model with transformations, optimizes it, and produces optimized model artifact.
Each component has its own Docker image, ensuring a reproducible runtime environment through the use
of Poetry and its .lock
files. To guarantee consistency in results, all components
employ fixed random seeds for randomized algorithms. Additionally, the training component includes
extra PyTorch reproducibility measures.
By default, component result caching is enabled, meaning components are not re-executed unless their input parameters or the components themselves are modified. For example, this would allow for adjustments to the evaluation code without the need for time-consuming retraining.
If we add pipeline runs of interest to Vertex AI Experiment, we can get side-by-side comparisons for free:
The implementation of this pipeline template/pipeline run is coupled with GCP. In order to be able to compile/upload the pipeline template, as well as create a pipeline run from the template, one has to:
- Have a GCP project.
- Enable Vertex AI API.
- Have a GCS bucket,
with "Weather Image Recognition" dataset uploaded. In
code, its
data_bucket
parameter to the pipeline, which defaults to"weather_imgs"
. Use your bucket name here. - Have a GCS staging bucket, where Kubeflow Pipelines can persist its artifacts.
- Have a Docker type repository in Artifact Registry, where built images of components are pushed.
- Have a Kubeflow Pipelines type repository in Artifact Registry, where templates of pipelines are pushed.
- Have gcloud cli installed.
- Correctly configure/authorize
gcloud
:
gcloud config set project $GCP_PROJECT
gcloud auth application-default login
- Have a local
.env
file with the followingenv
variables set:
KFP_REPOSITORY= # Your Kubeflow Pipelines type repository in Artifact Registry
STAGING_BUCKET= # Your GCS bucket, where Kubeflow Pipelines will persist its artifacts.
PREP_DATA_DOCKER_URI= # Your URI of data_prep component Docker image.
TRAIN_MODEL_DOCKER_URI= # Your URI of train component Docker image.
EVAL_MODEL_DOCKER_URI= # Your URI of eval component Docker image.
ONNX_OPTIMIZE_DOCKER_URI = # Your URI of onnx_optimize component Docker image.
- Have Poetry installed.
- Have Python
~3.12
active in the project directory (using pyenv is advised).
With the prerequisites met, compiling and uploading the pipeline template should be as simple as:
poetry install
poetry run pipeline
To create a pipeline run from the uploaded pipeline template, just follow the official Vertex AI documentation.
If the new platform offers a KFP-conformant backend,
porting the kfp pipeline to another platform should be straightforward. The main modification involves adjusting
the data fetching logic in the data_prep
, train
, and eval
components, where the google.cloud.storage.Client
is
currently used.
Other frameworks for building machine learning workflows besides Kubeflow Pipelines: