DVC demo for the SE4AI 2021-22 course

This is a demo ML project used to show the main features of DVC in the 2021 edition of the Software Engineering for AI-enabled Systems course (University of Bari, Italy - Dept. of Computer Science).

The scripts used in this project are freely inspired by the Kaggle Tutorial "Intermediate Machine Learning". Accordingly, the example uses data from the Housing Prices Competition for Kaggle Learn Users.

Import raw data

As a first step, we imported raw data using the dvc import command:

dvc import https://github.com/collab-uniba/Software-Solutions-for-Reproducible-ML-Experiments input/home-data-for-ml-course/train.csv -o data/raw

dvc import https://github.com/collab-uniba/Software-Solutions-for-Reproducible-ML-Experiments input/home-data-for-ml-course/test.csv -o data/raw

Observe that, although available in Kaggle, these data files were taken from another public GitHub repository, Software Solutions for Reproducible ML Experiments, to demonstrate this capability of DVC.

Setup a Python environment

Then, we created a Python (virtual) environment and installed the requirements for this project, which are listed in requirements.txt.

pip install -r requirements.txt

Run the ML pipeline stages via DVC

Finally, we executed the following three DVC run commands, corresponding to the three stages of this simple ML pipeline (data preparation, model training, and model evaluation).

Data preparation stage

dvc run -n prepare \
-p prepare.train_size,prepare.test_size,prepare.random_state \
-d src/prepare.py -d data/raw/train.csv -d data/raw/test.csv \
-o data/processed/X_train.csv -o data/processed/X_valid.csv \
-o data/processed/y_train.csv -o data/processed/y_valid.csv \
python src/prepare.py

Model training stage

dvc run -n train \
-p train.random_state,train.algorithm \
-d src/train.py -d data/processed/X_train.csv -d data/processed/y_train.csv \
-o models/iowa_model.pkl \
python src/train.py

Model evaluation stage

dvc run -n evaluate \
-d models/iowa_model.pkl -d src/evaluate.py -d data/processed/X_valid.csv -d data/processed/y_valid.csv \
-M metrics/scores.json \
python src/evaluate.py

Reproducing the whole pipeline

The details about each stage are automatically stored by DVC in the dvc.yaml file.

To reproduce the entire pipeline, it is sufficient to run:

dvc repro

The scripts from this repo are also available as a GitHub Gist.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.dvc		.dvc
.vscode		.vscode
data		data
metrics		metrics
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DVC demo for the SE4AI 2021-22 course

Import raw data

Setup a Python environment

Run the ML pipeline stages via DVC

Data preparation stage

Model training stage

Model evaluation stage

Reproducing the whole pipeline

About

Languages

se4ai2122-cs-uniba/SE4AI2021Course_DVC-demo

Folders and files

Latest commit

History

Repository files navigation

DVC demo for the SE4AI 2021-22 course

Import raw data

Setup a Python environment

Run the ML pipeline stages via DVC

Data preparation stage

Model training stage

Model evaluation stage

Reproducing the whole pipeline

About

Resources

Stars

Watchers

Forks

Languages