Skip to content

Commit ea39f30

Browse files
committed
fix indention
1 parent 8d00242 commit ea39f30

File tree

2 files changed

+70
-69
lines changed

2 files changed

+70
-69
lines changed

s2_organisation_and_version_control/dvc.md

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,17 @@
66

77
!!! info "Core Module"
88

9-
In this module we are going to return to version control. However, this time we are going to focus on version control
10-
of data. The reason we need to separate between standandard version control and data version control comes down to one
9+
In this module, we are going to return to version control. However, this time we are going to focus on version control
10+
of data. The reason we need to separate between standard version control and data version control comes down to one
1111
problem: size.
1212

1313
Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that
14-
contains 1000+ files with million lines of codes can probably be stored in less than a single gigabyte (GB). On the
15-
other hand, the size of data can be drastically bigger. As most machine learning algorithms only gets better with the
14+
contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the
15+
other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the
1616
more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).
1717

18-
Because this is a important concept there exist a couple of frameworks that have specialized in versioning data such as
19-
[dvc](https://dvc.org/), [DAGsHub](https://dagshub.com/), [Hub](https://www.activeloop.ai/),
18+
Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as
19+
[DVC](https://dvc.org/), [DAGsHub](https://dagshub.com/), [Hub](https://www.activeloop.ai/),
2020
[Modelstore](https://modelstore.readthedocs.io/en/latest/) and [ModelDB](https://github.com/VertaAI/modeldb/).
2121
Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files
2222
or in general storing any large *artifacts* files we instead store a pointer to these large flies. We then version
@@ -29,16 +29,16 @@ control the point instead of the artifact.
2929
</figcaption>
3030
</figure>
3131

32-
We are in this course going to use `dvc` provided by [iterative.ai](https://iterative.ai/) as they also provide tools
32+
We are in this course going to use `DVC` provided by [iterative.ai](https://iterative.ai/) as they also provide tools
3333
for automatizing machine learning, which we are going to focus on later.
3434

3535
## DVC: What is it?
3636

3737
DVC (Data Version Control) is simply an extension of `git` to not only take versioning data but also models and
38-
experiments in general. But how does it deal with these large data files? Essentially, `dvc` will just keep track of a
39-
small *metafile* that will then point to some remote location where you original data is store. *metafiles* essentially
40-
works as placeholders for your datafiles. Your large datafiles are then stored in some remote location such as Google
41-
drive or an `S3` bucket from Amazon.
38+
experiments in general. But how does it deal with these large data files? Essentially, `DVC` will just keep track of a
39+
small *metafile* that will then point to some remote location where your original data is stored. Metafiles
40+
essentially work as placeholders for your data files. Your large data files are then stored in some remote location such
41+
as Google Drive or an `S3` bucket from Amazon.
4242

4343
<figure markdown>
4444
![Image](../figures/dvc.png){ width="700" }
@@ -48,20 +48,20 @@ drive or an `S3` bucket from Amazon.
4848
</figure>
4949

5050
As the figure shows, we now have two remote locations: one for code and one for data. We use `git pull/push` for the
51-
code and `dvc pull/push` for the data. The key concept is the connection between the data file `model.pkl` that is
52-
fairly large and its respective *metafile* `model.pkl.dvc` that is very small. The large file is stored in the data
53-
remote and the metafile is stored in code remote.
51+
code and `dvc pull/push` for the data. The key concept is the connection between the data file `model.pkl` which is
52+
fairly large and its respective *metafile* `model.pkl.dvc` which is very small. The large file is stored in the data
53+
remote and the metafile is stored in the code remote.
5454

5555
## ❔ Exercises
5656

57-
If in doubt about some of the exercises, we recommend checking out the [documentation for dvc](https://dvc.org/doc) as
57+
If in doubt about some of the exercises, we recommend checking out the [documentation for DVC](https://dvc.org/doc) as
5858
it contains excellent tutorials.
5959

60-
1. For these exercises we are going to use [Google drive](https://www.google.com/intl/da/drive/) as remote storage
60+
1. For these exercises, we are going to use Google [drive](https://www.google.com/intl/da/drive/) as a remote storage
6161
solution for our data. If you do not already have a Google account, please create one (we are going to use it again
6262
in later exercises). Please make sure that you at least have 1GB of free space.
6363

64-
2. Next, install dvc and the Google drive extension
64+
2. Next, install DVC and the Google Drive extension
6565

6666
```bash
6767
pip install dvc
@@ -90,7 +90,7 @@ it contains excellent tutorials.
9090
this will setup `dvc` for this repository (similar to how `git init` will initialize a git repository).
9191
These files should be committed using standard `git` to your repository.
9292

93-
4. Go to your Google drive and create a new folder called `dtu_mlops_data`. Then copy the unique identifier
93+
4. Go to your Google Drive and create a new folder called `dtu_mlops_data`. Then copy the unique identifier
9494
belonging to that folder as shown in the figure below
9595

9696
<figure markdown>
@@ -103,7 +103,7 @@ it contains excellent tutorials.
103103
dvc remote add -d storage gdrive://<your_identifier>
104104
```
105105

106-
5. Check the content of the file `.dvc/config`. Does it contain a pointer to your remote storage? Afterwards make sure
106+
5. Check the content of the file `.dvc/config`. Does it contain a pointer to your remote storage? Afterwards, make sure
107107
to add this file to the next commit we are going to make:
108108

109109
```bash
@@ -112,13 +112,13 @@ it contains excellent tutorials.
112112

113113
6. Call the `dvc add` command on your data files exactly like you would add a file with `git` (you do not need to
114114
add every file by itself as you can directly add the `data/` folder). Doing this should create a human-readable
115-
file with the extension `.dvc`. This is the *metafile* as explained earlier that will serve as a placeholder for
116-
your data. If you are on Windows and this step fail you may need to install `pywin32`. At the same time the `data/`
115+
file with the extension `.dvc`. This is the *metafile* as explained earlier that will serve as a placeholder for
116+
your data. If you are on Windows and this step fails you may need to install `pywin32`. At the same time, the `data`
117117
folder should have been added to the `.gitignore` file that marks which files should not be tracked by git. Confirm
118118
that this is correct.
119119

120120
7. Now we are going to add, commit and tag the *metafiles* so we can restore to this stage later on. Commit and tag
121-
the files, should look something like this:
121+
the files, which should look something like this:
122122

123123
```bash
124124
git add data.dvc .gitignore
@@ -127,12 +127,12 @@ it contains excellent tutorials.
127127
```
128128

129129
8. Finally, push your data to the remote storage using `dvc push`. You will be asked to authenticate, which involves
130-
copy-pasting the code in the link prompted. Checkout your Google drive folder. You will see that the data is not
131-
in a recognizable format anymore due to the way that `dvc` packs and tracks the data. The boring details is that
130+
copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not
131+
in a recognizable format anymore due to the way that `dvc` packs and tracks the data. The boring detail is that
132132
`dvc` converts the data into [content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
133-
which makes data much faster to get. Finally, make sure that your data is not stored in your github repository.
133+
which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.
134134

135-
After authenticating the first time, dvc should be setup without having to authenticate again. If you for some
135+
After authenticating the first time, `DVC` should be setup without having to authenticate again. If you for some
136136
reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file
137137
`$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json` where `$CACHE_HOME` depends on your operating system:
138138

@@ -158,7 +158,7 @@ it contains excellent tutorials.
158158
```
159159

160160
(assuming that you give them access right to the folder in your drive). Try doing this (in some other location
161-
than your standard code) to make sure that the two commands indeed downloads both your code and data.
161+
than your standard code) to make sure that the two commands indeed download both your code and data.
162162

163163
10. Lets look about the process of updating our data. Remember the important aspect of version control is that we do not
164164
need to store explicit files called `data_v1.pt`, `data_v2.pt` etc. but just have a single `data.pt` that where we
@@ -168,6 +168,7 @@ it contains excellent tutorials.
168168

169169
11. Redo the above steps, adding the new data using `dvc`, committing and tagging the metafiles e.g. the following
170170
commands should be executed (with appropriate input):
171+
171172
`dvc add -> git add -> git commit -> git tag -> dvc push -> git push`.
172173

173174
12. Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly,
@@ -178,13 +179,13 @@ it contains excellent tutorials.
178179
dvc checkout
179180
```
180181

181-
confirm that you have reverted back to the original data.
182+
confirm that you have reverted to the original data.
182183

183184
13. (Optional) Finally, it is important to note that `dvc` is not only intended to be used to store data files but also
184-
any other large files such as trained model weights (with billion of parameters these can be quite large). For
185-
example if we always stored out best performing model in a file called `best_model.ckpt` then we can use `dvc` to
186-
version control it, store it online and make it easy for other to download. Feel free to experiment with this using
187-
your own model checkpoints.
185+
any other large files such as trained model weights (with billions of parameters these can be quite large). For
186+
example, if we always store our best-performing model in a file called `best_model.ckpt` then we can use `dvc` to
187+
version control it, store it online and make it easy for others to download. Feel free to experiment with this using
188+
your model checkpoints.
188189

189190
## 🧠 Knowledge check
190191

@@ -210,7 +211,7 @@ it contains excellent tutorials.
210211

211212
That's all for today. With the combined power of `git` and `dvc` we should be able to version control everything in
212213
our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that `dvc`
213-
offers such more than just data version control, so if you want to deep dive into `dvc` we recommend their
214+
offers more than just data version control, so if you want to deep dive into `dvc` we recommend their
214215
[pipeline](https://dvc.org/doc/user-guide/project-structure/pipelines-files) feature and how this can be used to setup
215216
version controlled [experiments](https://dvc.org/doc/command-reference/exp). Note that we are going to revisit `dvc`
216-
later for a more permanent (and large scale) storage solution.
217+
later for a more permanent (and large-scale) storage solution.

s5_continuous_integration/auto_docker.md

Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -39,53 +39,53 @@ not store our data in Github, we cannot copy it during the build process.
3939
2. Start by creating a [Docker Hub account](https://hub.docker.com/)
4040

4141
3. Next, within Docker Hub create an access token by going to `Settings -> Security`. Click the `New Access Token`
42-
button and give it a name that you recognize.
42+
button and give it a name that you recognize.
4343

4444
4. Copy the newly created access token and head over to your Github repository online. Go to
45-
`Settings -> Secrets -> Actions` and click the `New repository secret`. Copy over the access token and give
46-
it the name `DOCKER_HUB_TOKEN`. Additionally, add two other secrets `DOCKER_HUB_USERNAME` and `DOCKER_HUB_REPOSITORY`
47-
that contains your docker username and docker repository name respectively.
45+
`Settings -> Secrets -> Actions` and click the `New repository secret`. Copy over the access token and give
46+
it the name `DOCKER_HUB_TOKEN`. Additionally, add two other secrets `DOCKER_HUB_USERNAME` and `DOCKER_HUB_REPOSITORY`
47+
that contains your docker username and docker repository name respectively.
4848

4949
5. Next we are going to construct the actual Github actions workflow file:
5050

51-
```yaml
52-
name: Docker Image CI
53-
54-
on:
55-
push:
56-
branches: [ master ]
57-
58-
jobs:
59-
build:
60-
runs-on: ubuntu-latest
61-
steps:
62-
- uses: actions/checkout@v2
63-
- name: Build the Docker image
64-
run: |
65-
echo "${{ secrets.DOCKER_HUB_TOKEN }}" | docker login \
66-
-u "${{ secrets.DOCKER_HUB_USERNAME }}" --password-stdin docker.io
67-
docker build . --file Dockerfile \
68-
--tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
69-
docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
70-
```
71-
72-
The first part of the workflow file should look somewhat recognizable. However, the last three lines are where
73-
all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking
74-
at the help page for `docker login`, `docker build` and `docker push`.
51+
```yaml
52+
name: Docker Image CI
53+
54+
on:
55+
push:
56+
branches: [ master ]
57+
58+
jobs:
59+
build:
60+
runs-on: ubuntu-latest
61+
steps:
62+
- uses: actions/checkout@v2
63+
- name: Build the Docker image
64+
run: |
65+
echo "${{ secrets.DOCKER_HUB_TOKEN }}" | docker login \
66+
-u "${{ secrets.DOCKER_HUB_USERNAME }}" --password-stdin docker.io
67+
docker build . --file Dockerfile \
68+
--tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
69+
docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
70+
```
71+
72+
The first part of the workflow file should look somewhat recognizable. However, the last three lines are where
73+
all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking
74+
at the help page for `docker login`, `docker build` and `docker push`.
7575

7676
6. Upload the workflow to your github repository and check that it is being executed. If everything you should be able
77-
to see the the build docker image in your container repository in docker hub.
77+
to see the the build docker image in your container repository in docker hub.
7878

7979
7. Make sure that you can execute `docker pull` locally to pull down the image that you just continuously build
8080

8181
8. (Optional) To test that the container works directly in github you can also try to include an additional
82-
step that actually runs the container.
82+
step that actually runs the container.
8383

84-
```yaml
85-
- name: Run container
86-
run: |
87-
docker run ...
88-
```
84+
```yaml
85+
- name: Run container
86+
run: |
87+
docker run ...
88+
```
8989

9090
That ends the session on continues docker building. We are going to revisit this topic after introducing the basic
9191
concepts of working in the cloud, as it will make our life easier in the long run when we get to continues deployment

0 commit comments

Comments
 (0)