Example repository for Data Version Control
Clone the forked repository to your computer with the git clone command
git clone git@gitlab.portofantwerp.com:data-analytics/data-science/tutorial-data-vesion-control.gitHappy coding!
-
Set-up virtual environment
pip install virtualenv virtualenv tutorial-data-version-control source tutorial-data-version-control/venv/bin/activate pip -r requirements.txt -
Make change to the virtual environment
python -m pip install dvc scikit-learn scikit-image pandas numpy pip freeze > requirements.txt -
Download the data
curl https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz -O imagenette2-160.tgz mv imagenette2-160/train data/raw/train mv imagenette2-160/val data/raw/val rm -rf imagenette2-160 rm imagenette2-160.tgz
-
Initialize dvc
dvc init
-
Create container in Azure storage account.
-
Connect to the dvc remote
dvc remote add -d remote_storage azure://tutorialpierre dvc remote modify remote_storage account_name 'pocazureml7275011038' dvc remote modify --local remote_storage connection_string '<connection_string>'
-
Add data to dvc control
dvc add data/raw/train dvc add data/raw/val
-
add dvc to git control
git add . git commit -m 'feat: add dvc integration'
-
Push the data to the remote storage
dvc push
-
Perform prepare step
python src/prepare.py dvc add data/prepared/train.csv data/prepared/test.csv git add . git commit -m "Created train and test CSV files"
-
Perform train step
python src/train.py dvc add model/model.joblib git add . git commit -m "Trained an SGD classifier"
-
Perfrom evaluate step
python src/evaluate.py git add . git commit -m "Evaluate the SGD model accuracy"
-
Push to git and dvc
git push dvc push git tag -a sgd-classifier -m "SGDClassifier with accuracy 67.06%"
It is good practice to create a seperate branch for every experiment.
-
Checkout to new branch
git checkout -b "sgd-100-iterations" -
Change hyperparameter in train.py
-
Execute the training and evaluation step
python src/train.py python src/evaluate.py
-
Commit the change model in dvc (stores data in cache and change dvc lock files)
dvc commit
-
Commit the model in git
git add . git commit -m "Change SGD max_iter to 100"
-
Tag and push
git tag -a sgd-100-iter -m "Trained an SGD Classifier for 100 iterations" git push origin --tags git push --set-upstream origin sgd-100-iter dvc push -
Swith to previous model
git checkout main dvc checkout
You fetched the data manually and added it to remote storage. You can now get it with dvc checkout or dvc pull. The other steps were executed by running various Python files. These can be chained together into a single execution called a DVC pipeline that requires only one command.
-
New branch
git checkout -b sgd-pipeline
-
Remove dvc tracking from the output files
dvc remove data/prepared/train.csv.dvc \ data/prepared/test.csv.dvc \ model/model.joblib.dvc --outs -
Create pipeline stages:
-
Prepare stage
dvc run -n prepare \ -d src/prepare.py -d data/raw \ -o data/prepared/train.csv -o data/prepared/test.csv \ python src/prepare.py -
Train stage
dvc run -n train \ -d src/train.py -d data/prepared/train.csv \ -o model/model.joblib \ python src/train.py -
Train stage
dvc run -n evaluate \ -d src/evaluate.py -d model/model.joblib \ -M metrics/accuracy.json \ python src/evaluate.py
-
-
Add pipeline to git
git add . git commit -m "Rerun SGD as pipeline" dvc commit git push --set-upstream origin sgd-pipeline git tag -a sgd-pipeline -m "Trained SGD as DVC pipeline." git push origin --tags dvc push
-
Make change to the pipeline
git checkout -b "random_forest" -
CHANGE to random forrest => DETECT THE CHANGE
dvc status
-
Execute pipeline
dvc repro
-
add to git
git add . git commit -m "Train Random Forrest classifier" dvc commit git push --set-upstream origin random-forest git tag -a random-forest -m "Random Forest classifier with 80.99% accuracy." git push origin --tags dvc push
-
Compare the runs
dvc metrics show -T
- Add azureml config file to the repository
- Log in to Azure
az login --use-device-code
- Add dependencies (already present in the requirement.txt file)
pip install mlflow, azureml-core, azureml-mlflow
- Track experiment
python src/log_mlflow.py