Data Version Control Tutorial

Example repository for Data Version Control

Clone the forked repository to your computer with the git clone command

git clone git@gitlab.portofantwerp.com:data-analytics/data-science/tutorial-data-vesion-control.git

Happy coding!

Tutorial

Set-up project

Set-up virtual environment

pip install virtualenv
virtualenv tutorial-data-version-control
source tutorial-data-version-control/venv/bin/activate
pip -r requirements.txt

Make change to the virtual environment

python -m pip install dvc scikit-learn scikit-image pandas numpy
pip freeze > requirements.txt

Download the data

curl https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz -O imagenette2-160.tgz
mv imagenette2-160/train data/raw/train
mv imagenette2-160/val data/raw/val
rm -rf imagenette2-160
rm imagenette2-160.tgz

Set-up dvc

Initialize dvc
```
dvc init
```
Create container in Azure storage account.

Connect to the dvc remote

dvc remote add -d remote_storage azure://tutorialpierre
dvc remote modify remote_storage account_name 'pocazureml7275011038'
dvc remote modify --local remote_storage connection_string '<connection_string>'

Add data to dvc control

 dvc add data/raw/train
 dvc add data/raw/val

add dvc to git control

git add .
git commit -m 'feat: add dvc integration'

Push the data to the remote storage
```
dvc push
```

Manual process

First experiment

Perform prepare step

python src/prepare.py
dvc add data/prepared/train.csv data/prepared/test.csv
git add .
git commit -m "Created train and test CSV files"

Perform train step

 python src/train.py
 dvc add model/model.joblib
 git add .
 git commit -m "Trained an SGD classifier"

Perfrom evaluate step

 python src/evaluate.py
 git add .
 git commit -m "Evaluate the SGD model accuracy"

Push to git and dvc

git push
dvc push
git tag -a sgd-classifier -m "SGDClassifier with accuracy 67.06%"

Second experiment (100 iterations)

It is good practice to create a seperate branch for every experiment.

Checkout to new branch
```
git checkout -b "sgd-100-iterations"
```
Change hyperparameter in train.py

Execute the training and evaluation step

python src/train.py
python src/evaluate.py

Commit the change model in dvc (stores data in cache and change dvc lock files)
```
dvc commit
```

Commit the model in git

git add .
git commit -m "Change SGD max_iter to 100"

Tag and push

git tag -a sgd-100-iter -m "Trained an SGD Classifier for 100 iterations"
git push origin --tags

git push --set-upstream origin sgd-100-iter
dvc push

Swith to previous model
```
git checkout main
dvc checkout
```

Create a reproducible pipeline

You fetched the data manually and added it to remote storage. You can now get it with dvc checkout or dvc pull. The other steps were executed by running various Python files. These can be chained together into a single execution called a DVC pipeline that requires only one command.

New branch
```
git checkout -b sgd-pipeline
```

Remove dvc tracking from the output files

dvc remove data/prepared/train.csv.dvc \
    data/prepared/test.csv.dvc \
    model/model.joblib.dvc --outs

Create pipeline stages:

Prepare stage

dvc run -n prepare \
    -d src/prepare.py -d data/raw \
    -o data/prepared/train.csv -o data/prepared/test.csv \
    python src/prepare.py

Train stage

dvc run -n train \
    -d src/train.py -d data/prepared/train.csv \
    -o model/model.joblib \
    python src/train.py

Train stage

dvc run -n evaluate \
    -d src/evaluate.py -d model/model.joblib \
    -M metrics/accuracy.json \
    python src/evaluate.py

Add pipeline to git

git add .
git commit -m "Rerun SGD as pipeline"
dvc commit
git push --set-upstream origin sgd-pipeline
git tag -a sgd-pipeline -m "Trained SGD as DVC pipeline."
git push origin --tags
dvc push

Make change to the pipeline
```
git checkout -b "random_forest"
```
CHANGE to random forrest => DETECT THE CHANGE
```
dvc status
```
Execute pipeline
```
dvc repro
```

add to git

git add .
git commit -m "Train Random Forrest classifier"
dvc commit
git push --set-upstream origin random-forest
git tag -a random-forest -m "Random Forest classifier with 80.99% accuracy."
git push origin --tags
dvc push

Compare the runs
```
dvc metrics show -T
```

MLFlow logging

Add azureml config file to the repository
Log in to Azure
```
  az login --use-device-code
```
Add dependencies (already present in the requirement.txt file)
```
 pip install mlflow, azureml-core, azureml-mlflow
```
Track experiment
```
  python src/log_mlflow.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.azureml		.azureml
.dvc		.dvc
data		data
metrics		metrics
model		model
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
docs.sh		docs.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Version Control Tutorial

Tutorial

Set-up project

Set-up dvc

Add data to dvc control

Manual process

First experiment

Second experiment (100 iterations)

Create a reproducible pipeline

MLFlow logging

About

Uh oh!

Releases

Packages

Uh oh!

Languages

PGerardi/tutorial-dvc

Folders and files

Latest commit

History

Repository files navigation

Data Version Control Tutorial

Tutorial

Set-up project

Set-up dvc

Add data to dvc control

Manual process

First experiment

Second experiment (100 iterations)

Create a reproducible pipeline

MLFlow logging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages