6
6
7
7
!!! info "Core Module"
8
8
9
- In this module we are going to return to version control. However, this time we are going to focus on version control
10
- of data. The reason we need to separate between standandard version control and data version control comes down to one
9
+ In this module, we are going to return to version control. However, this time we are going to focus on version control
10
+ of data. The reason we need to separate between standard version control and data version control comes down to one
11
11
problem: size.
12
12
13
13
Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that
14
- contains 1000+ files with million lines of codes can probably be stored in less than a single gigabyte (GB). On the
15
- other hand, the size of data can be drastically bigger. As most machine learning algorithms only gets better with the
14
+ contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the
15
+ other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the
16
16
more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).
17
17
18
- Because this is a important concept there exist a couple of frameworks that have specialized in versioning data such as
19
- [ dvc ] ( https://dvc.org/ ) , [ DAGsHub] ( https://dagshub.com/ ) , [ Hub] ( https://www.activeloop.ai/ ) ,
18
+ Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as
19
+ [ DVC ] ( https://dvc.org/ ) , [ DAGsHub] ( https://dagshub.com/ ) , [ Hub] ( https://www.activeloop.ai/ ) ,
20
20
[ Modelstore] ( https://modelstore.readthedocs.io/en/latest/ ) and [ ModelDB] ( https://github.com/VertaAI/modeldb/ ) .
21
21
Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files
22
22
or in general storing any large * artifacts* files we instead store a pointer to these large flies. We then version
@@ -29,16 +29,16 @@ control the point instead of the artifact.
29
29
</figcaption >
30
30
</figure >
31
31
32
- We are in this course going to use ` dvc ` provided by [ iterative.ai] ( https://iterative.ai/ ) as they also provide tools
32
+ We are in this course going to use ` DVC ` provided by [ iterative.ai] ( https://iterative.ai/ ) as they also provide tools
33
33
for automatizing machine learning, which we are going to focus on later.
34
34
35
35
## DVC: What is it?
36
36
37
37
DVC (Data Version Control) is simply an extension of ` git ` to not only take versioning data but also models and
38
- experiments in general. But how does it deal with these large data files? Essentially, ` dvc ` will just keep track of a
39
- small * metafile* that will then point to some remote location where you original data is store. * metafiles * essentially
40
- works as placeholders for your datafiles . Your large datafiles are then stored in some remote location such as Google
41
- drive or an ` S3 ` bucket from Amazon.
38
+ experiments in general. But how does it deal with these large data files? Essentially, ` DVC ` will just keep track of a
39
+ small * metafile* that will then point to some remote location where your original data is stored. Metafiles
40
+ essentially work as placeholders for your data files . Your large data files are then stored in some remote location such
41
+ as Google Drive or an ` S3 ` bucket from Amazon.
42
42
43
43
<figure markdown >
44
44
![ Image] ( ../figures/dvc.png ) { width="700" }
@@ -48,20 +48,20 @@ drive or an `S3` bucket from Amazon.
48
48
</figure >
49
49
50
50
As the figure shows, we now have two remote locations: one for code and one for data. We use ` git pull/push ` for the
51
- code and ` dvc pull/push ` for the data. The key concept is the connection between the data file ` model.pkl ` that is
52
- fairly large and its respective * metafile* ` model.pkl.dvc ` that is very small. The large file is stored in the data
53
- remote and the metafile is stored in code remote.
51
+ code and ` dvc pull/push ` for the data. The key concept is the connection between the data file ` model.pkl ` which is
52
+ fairly large and its respective * metafile* ` model.pkl.dvc ` which is very small. The large file is stored in the data
53
+ remote and the metafile is stored in the code remote.
54
54
55
55
## ❔ Exercises
56
56
57
- If in doubt about some of the exercises, we recommend checking out the [ documentation for dvc ] ( https://dvc.org/doc ) as
57
+ If in doubt about some of the exercises, we recommend checking out the [ documentation for DVC ] ( https://dvc.org/doc ) as
58
58
it contains excellent tutorials.
59
59
60
- 1 . For these exercises we are going to use [ Google drive] ( https://www.google.com/intl/da/drive/ ) as remote storage
60
+ 1 . For these exercises, we are going to use Google [ drive] ( https://www.google.com/intl/da/drive/ ) as a remote storage
61
61
solution for our data. If you do not already have a Google account, please create one (we are going to use it again
62
62
in later exercises). Please make sure that you at least have 1GB of free space.
63
63
64
- 2 . Next, install dvc and the Google drive extension
64
+ 2 . Next, install DVC and the Google Drive extension
65
65
66
66
``` bash
67
67
pip install dvc
@@ -90,7 +90,7 @@ it contains excellent tutorials.
90
90
this will setup ` dvc` for this repository (similar to how ` git init` will initialize a git repository).
91
91
These files should be committed using standard ` git` to your repository.
92
92
93
- 4. Go to your Google drive and create a new folder called ` dtu_mlops_data` . Then copy the unique identifier
93
+ 4. Go to your Google Drive and create a new folder called ` dtu_mlops_data` . Then copy the unique identifier
94
94
belonging to that folder as shown in the figure below
95
95
96
96
<figure markdown>
@@ -103,7 +103,7 @@ it contains excellent tutorials.
103
103
dvc remote add -d storage gdrive://< your_identifier>
104
104
` ` `
105
105
106
- 5. Check the content of the file ` .dvc/config` . Does it contain a pointer to your remote storage? Afterwards make sure
106
+ 5. Check the content of the file ` .dvc/config` . Does it contain a pointer to your remote storage? Afterwards, make sure
107
107
to add this file to the next commit we are going to make:
108
108
109
109
` ` ` bash
@@ -112,13 +112,13 @@ it contains excellent tutorials.
112
112
113
113
6. Call the ` dvc add` command on your data files exactly like you would add a file with ` git` (you do not need to
114
114
add every file by itself as you can directly add the ` data/` folder). Doing this should create a human-readable
115
- file with the extension ` .dvc` . This is the * metafile* as explained earlier that will serve as a placeholder for
116
- your data. If you are on Windows and this step fail you may need to install ` pywin32` . At the same time the ` data/ `
115
+ file with the extension ` .dvc` . This is the * metafile* as explained earlier that will serve as a placeholder for
116
+ your data. If you are on Windows and this step fails you may need to install ` pywin32` . At the same time, the ` data`
117
117
folder should have been added to the ` .gitignore` file that marks which files should not be tracked by git. Confirm
118
118
that this is correct.
119
119
120
120
7. Now we are going to add, commit and tag the * metafiles* so we can restore to this stage later on. Commit and tag
121
- the files, should look something like this:
121
+ the files, which should look something like this:
122
122
123
123
` ` ` bash
124
124
git add data.dvc .gitignore
@@ -127,12 +127,12 @@ it contains excellent tutorials.
127
127
` ` `
128
128
129
129
8. Finally, push your data to the remote storage using ` dvc push` . You will be asked to authenticate, which involves
130
- copy-pasting the code in the link prompted. Checkout your Google drive folder. You will see that the data is not
131
- in a recognizable format anymore due to the way that ` dvc` packs and tracks the data. The boring details is that
130
+ copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not
131
+ in a recognizable format anymore due to the way that ` dvc` packs and tracks the data. The boring detail is that
132
132
` dvc` converts the data into [content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
133
- which makes data much faster to get. Finally, make sure that your data is not stored in your github repository.
133
+ which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.
134
134
135
- After authenticating the first time, dvc should be setup without having to authenticate again. If you for some
135
+ After authenticating the first time, ` DVC ` should be setup without having to authenticate again. If you for some
136
136
reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file
137
137
` $CACHE_HOME /pydrive2fs/{gdrive_client_id}/default.json` where ` $CACHE_HOME ` depends on your operating system:
138
138
@@ -158,7 +158,7 @@ it contains excellent tutorials.
158
158
` ` `
159
159
160
160
(assuming that you give them access right to the folder in your drive). Try doing this (in some other location
161
- than your standard code) to make sure that the two commands indeed downloads both your code and data.
161
+ than your standard code) to make sure that the two commands indeed download both your code and data.
162
162
163
163
10. Lets look about the process of updating our data. Remember the important aspect of version control is that we do not
164
164
need to store explicit files called ` data_v1.pt` , ` data_v2.pt` etc. but just have a single ` data.pt` that where we
@@ -168,6 +168,7 @@ it contains excellent tutorials.
168
168
169
169
11. Redo the above steps, adding the new data using ` dvc` , committing and tagging the metafiles e.g. the following
170
170
commands should be executed (with appropriate input):
171
+
171
172
` dvc add -> git add -> git commit -> git tag -> dvc push -> git push` .
172
173
173
174
12. Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly,
@@ -178,13 +179,13 @@ it contains excellent tutorials.
178
179
dvc checkout
179
180
` ` `
180
181
181
- confirm that you have reverted back to the original data.
182
+ confirm that you have reverted to the original data.
182
183
183
184
13. (Optional) Finally, it is important to note that ` dvc` is not only intended to be used to store data files but also
184
- any other large files such as trained model weights (with billion of parameters these can be quite large). For
185
- example if we always stored out best performing model in a file called ` best_model.ckpt` then we can use ` dvc` to
186
- version control it, store it online and make it easy for other to download. Feel free to experiment with this using
187
- your own model checkpoints.
185
+ any other large files such as trained model weights (with billions of parameters these can be quite large). For
186
+ example, if we always store our best- performing model in a file called ` best_model.ckpt` then we can use ` dvc` to
187
+ version control it, store it online and make it easy for others to download. Feel free to experiment with this using
188
+ your model checkpoints.
188
189
189
190
# # 🧠 Knowledge check
190
191
@@ -210,7 +211,7 @@ it contains excellent tutorials.
210
211
211
212
That' s all for today. With the combined power of `git` and `dvc` we should be able to version control everything in
212
213
our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that `dvc`
213
- offers such more than just data version control, so if you want to deep dive into `dvc` we recommend their
214
+ offers more than just data version control, so if you want to deep dive into `dvc` we recommend their
214
215
[pipeline](https://dvc.org/doc/user-guide/project-structure/pipelines-files) feature and how this can be used to setup
215
216
version controlled [experiments](https://dvc.org/doc/command-reference/exp). Note that we are going to revisit `dvc`
216
- later for a more permanent (and large scale) storage solution.
217
+ later for a more permanent (and large- scale) storage solution.
0 commit comments