Skip to content

DVC demo for the Software Engineering for AI-enabled Systems course (2021).

Notifications You must be signed in to change notification settings

se4ai2122-cs-uniba/SE4AI2021Course_DVC-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DVC demo for the SE4AI 2021-22 course

This is a demo ML project used to show the main features of DVC in the 2021 edition of the Software Engineering for AI-enabled Systems course (University of Bari, Italy - Dept. of Computer Science).

The scripts used in this project are freely inspired by the Kaggle Tutorial "Intermediate Machine Learning". Accordingly, the example uses data from the Housing Prices Competition for Kaggle Learn Users.

Import raw data

As a first step, we imported raw data using the dvc import command:

dvc import https://github.com/collab-uniba/Software-Solutions-for-Reproducible-ML-Experiments input/home-data-for-ml-course/train.csv -o data/raw

dvc import https://github.com/collab-uniba/Software-Solutions-for-Reproducible-ML-Experiments input/home-data-for-ml-course/test.csv -o data/raw

Observe that, although available in Kaggle, these data files were taken from another public GitHub repository, Software Solutions for Reproducible ML Experiments, to demonstrate this capability of DVC.

Setup a Python environment

Then, we created a Python (virtual) environment and installed the requirements for this project, which are listed in requirements.txt.

pip install -r requirements.txt

Run the ML pipeline stages via DVC

Finally, we executed the following three DVC run commands, corresponding to the three stages of this simple ML pipeline (data preparation, model training, and model evaluation).

Data preparation stage

dvc run -n prepare \
-p prepare.train_size,prepare.test_size,prepare.random_state \
-d src/prepare.py -d data/raw/train.csv -d data/raw/test.csv \
-o data/processed/X_train.csv -o data/processed/X_valid.csv \
-o data/processed/y_train.csv -o data/processed/y_valid.csv \
python src/prepare.py

Model training stage

dvc run -n train \
-p train.random_state,train.algorithm \
-d src/train.py -d data/processed/X_train.csv -d data/processed/y_train.csv \
-o models/iowa_model.pkl \
python src/train.py

Model evaluation stage

dvc run -n evaluate \
-d models/iowa_model.pkl -d src/evaluate.py -d data/processed/X_valid.csv -d data/processed/y_valid.csv \
-M metrics/scores.json \
python src/evaluate.py

Reproducing the whole pipeline

The details about each stage are automatically stored by DVC in the dvc.yaml file.

To reproduce the entire pipeline, it is sufficient to run:

dvc repro

The scripts from this repo are also available as a GitHub Gist.

About

DVC demo for the Software Engineering for AI-enabled Systems course (2021).

Resources

Stars

Watchers

Forks

Languages