project-of-the-week/2024-05-22-DVC.md at main · DataTalksClub/project-of-the-week · GitHub

DIY Data Version Control (DVC)

Goal:Learn how to use DVC as a tool for versioning data, pipeline managing, experimentation, and many other usages. Get practical experience by using it for your project.
Dates: from 22th May to 28th May.
Where: #project-of-the-week in DataTalks.Club (get in slack here: https://datatalks.club/slack.html)

For more information about the "Project of the Week" initiative at DataTalks.Club, see README.md.

If you want to receive reminders about this event, sign up here

Technologies

VS Code
Python
Git

Note: this is a suggested list of technologies, you can chose alternatives instead

Plan

This is a proposed plan only, you don’t have to follow it day-by-day.

Day 1 (22 May, Wednesday)

Create a GitHub repository.
Install DVC in your own OS with [2]
Understand and run Data version control capabilities (with custom data for example txt or csv) check out [1] and [3]. Friendly reminder: DVC is not just a version control tool it also contains experimentation, pipelines, and more.
Share your progress in Slack and on social media.

Suggested material:

Found good materials? Create a PR with links!

Day 2 (23 May, Thursday)

Create an ML project pipeline that contains a processing, training, and evaluation step. For dataset ideas check the first link in the suggested materials [1]. I would suggest using small datasets and light libraries (sklearn and datasets) remember, the goal is to explore/learn the tool.
For ideas on how to split your ML pipeline, you can check the official example: [2]. I made also a simple ml pipeline with a random forest with iris data if you want to copy: [3]
Create a params.yml that is going to store important parameters for the processing and training steps of your ML pipeline. Check these examples: Official example [4], mine more simple example [5]
Push your changes to GitHub.
Share your progress in Slack and on social media.

Notice the pipelines.py scripts do not have any dependency on DVClibrary (only the evaluation part, so you can skip it for now).

Suggested material:

💾 List of Dataset/Project ideas
💻 DVC Official ml pipeline example
💻 My GitHub’s simpler ml pipeline example with iris dataset
💻 Params.yml from DVC example
💻 Params.yml more simplified

Found good materials? Create a PR with links!

Day 3 (24 May, Friday)

Finish preparing your own project if you haven’t from the previous day
Perform a version of your dataset with DVC: [1],[2]
For configuring the storage you can check [3]. In case you don’t want to spend time setting a remote storage, you can also use a local folder.
Play with dvc commands dvc status,dvc add, dvc push, dvc checkout (also with your git commands)
Push your changes to GitHub.
Share your progress in Slack and on social media.

Suggested material:

Found good materials? Create a PR with links!

Day 4 (25 May, Saturday)

Try to build pipelines based on the official documentation [1]
You can follow the official video tutorial [2]. The video is from 3 years ago and some commands might have changed so make sure you use [1] docs in parallel.
Check out the summary in [3] to understand what this DVC feature solves.
Push your changes to GitHub.
Share your progress in Slack and on social media.

Note: It is important to have your params.yml ready and the dependencies correct in dvc stage add flags. Post your dvc dag if you want 🙂

Suggested material:

Found good materials? Create a PR with links!

Day 5 (26 May, Sunday)

Create an evaluation.py script that prints and plots metrics from your ML model.
Install DVClive and adapt the library functions to your script like the official tutorial: [1].
Check out the official hands-on tutorial: [2]
Compare the git committed code with the local changes (in terms of params, metrics, and plots)
Push your changes to GitHub.
Share your progress in Slack and on social media.

Suggested material:

🗒️Official doc tutorial for metric parameter plots
📺Hands-on tutorial

Found good materials? Create a PR with links!

Day 6 (27 May, Monday)

Continue with developing yesterday’s task
Change some of the parameters in params.yml and run the pipeline. To compare those ‘experiments’ you can use this reference [1]
Run and compare different experiments using [2]
Push your changes to GitHub.
Share your progress in Slack and on social media

Suggested material:

🗒️Official doc tutorial for tracking
🗒️Comparing experiments

Found good materials? Create a PR with links!

Day 7 (28 May, Tuesday)

Polish the documentation for your project.
Continue exploring more about this topic: Check out how to share Data and Models Hands-on [1], Experiments [2]
Push your changes to GitHub.
Share your progress in Slack and on social media.
Give us feedback.
Add the link to your project to this project of the week GitHub page.

Suggested material:

Projects

List of projects from our participants:

...
(Create a PR)

(We will put the projects here after the event finishes)