fix indention

SkafteNicki · SkafteNicki · commit ea39f30f5f09 · 2023-11-22T16:07:54.000+01:00
diff --git a/s2_organisation_and_version_control/dvc.md b/s2_organisation_and_version_control/dvc.md
@@ -6,17 +6,17 @@
 
 !!! info "Core Module"
 
-In this module we are going to return to version control. However, this time we are going to focus on version control
-of data. The reason we need to separate between standandard version control and data version control comes down to one
+In this module, we are going to return to version control. However, this time we are going to focus on version control
+of data. The reason we need to separate between standard version control and data version control comes down to one
 problem: size.
 
 Classic version control was developed to keep track of code files, which all are simple text files. Even a codebase that
-contains 1000+ files with million lines of codes can probably be stored in less than a single gigabyte (GB). On the
-other hand, the size of data can be drastically bigger. As most machine learning algorithms only gets better with the
+contains 1000+ files with millions of lines of code can probably be stored in less than a single gigabyte (GB). On the
+other hand, the size of data can be drastically bigger. As most machine learning algorithms only get better with the
 more data that you feed them, we are seeing models today that are being trained on petabytes of data (1.000.000 GB).
 
-Because this is a important concept there exist a couple of frameworks that have specialized in versioning data such as
-[dvc](https://dvc.org/), [DAGsHub](https://dagshub.com/), [Hub](https://www.activeloop.ai/),
+Because this is an important concept there exist a couple of frameworks that have specialized in versioning data such as
+[DVC](https://dvc.org/), [DAGsHub](https://dagshub.com/), [Hub](https://www.activeloop.ai/),
 [Modelstore](https://modelstore.readthedocs.io/en/latest/) and [ModelDB](https://github.com/VertaAI/modeldb/).
 Regardless of what framework, they all implement somewhat the same concept: instead of storing the actual data files
 or in general storing any large *artifacts* files we instead store a pointer to these large flies. We then version
@@ -29,16 +29,16 @@ control the point instead of the artifact.
 </figcaption>
 </figure>
 
-We are in this course going to use `dvc` provided by [iterative.ai](https://iterative.ai/) as they also provide tools
+We are in this course going to use `DVC` provided by [iterative.ai](https://iterative.ai/) as they also provide tools
 for automatizing machine learning, which we are going to focus on later.
 
 ## DVC: What is it?
 
 DVC (Data Version Control) is simply an extension of `git` to not only take versioning data but also models and
-experiments in general. But how does it deal with these large data files? Essentially, `dvc` will just keep track of a
-small *metafile* that will then point to some remote location where you original data is store. *metafiles* essentially
-works as placeholders for your datafiles. Your large datafiles are then stored in some remote location such as Google
-drive or an `S3` bucket from Amazon.
+experiments in general. But how does it deal with these large data files? Essentially, `DVC` will just keep track of a
+small *metafile* that will then point to some remote location where your original data is stored. Metafiles
+essentially work as placeholders for your data files. Your large data files are then stored in some remote location such
+as Google Drive or an `S3` bucket from Amazon.
 
 <figure markdown>
   ![Image](../figures/dvc.png){ width="700" }
@@ -48,20 +48,20 @@ drive or an `S3` bucket from Amazon.
 </figure>
 
 As the figure shows, we now have two remote locations: one for code and one for data. We use `git pull/push` for the
-code and `dvc pull/push` for the data. The key concept is the connection between the data file `model.pkl` that is
-fairly large and its respective *metafile* `model.pkl.dvc` that is very small. The large file is stored in the data
-remote and the metafile is stored in code remote.
+code and `dvc pull/push` for the data. The key concept is the connection between the data file `model.pkl` which is
+fairly large and its respective *metafile* `model.pkl.dvc` which is very small. The large file is stored in the data
+remote and the metafile is stored in the code remote.
 
 ## ❔ Exercises
 
-If in doubt about some of the exercises, we recommend checking out the [documentation for dvc](https://dvc.org/doc) as
+If in doubt about some of the exercises, we recommend checking out the [documentation for DVC](https://dvc.org/doc) as
 it contains excellent tutorials.
 
-1. For these exercises we are going to use [Google drive](https://www.google.com/intl/da/drive/) as remote storage
+1. For these exercises, we are going to use Google [drive](https://www.google.com/intl/da/drive/) as a remote storage
     solution for our data. If you do not already have a Google account, please create one (we are going to use it again
     in later exercises). Please make sure that you at least have 1GB of free space.
 
-2. Next, install dvc and the Google drive extension
+2. Next, install DVC and the Google Drive extension
 
     ```bash
     pip install dvc
@@ -90,7 +90,7 @@ it contains excellent tutorials.
     this will setup `dvc` for this repository (similar to how `git init` will initialize a git repository).
     These files should be committed using standard `git` to your repository.
 
-4. Go to your Google drive and create a new folder called `dtu_mlops_data`. Then copy the unique identifier
+4. Go to your Google Drive and create a new folder called `dtu_mlops_data`. Then copy the unique identifier
     belonging to that folder as shown in the figure below
 
     <figure markdown>
@@ -103,7 +103,7 @@ it contains excellent tutorials.
     dvc remote add -d storage gdrive://<your_identifier>
     ```
 
-5. Check the content of the file `.dvc/config`. Does it contain a pointer to your remote storage? Afterwards make sure
+5. Check the content of the file `.dvc/config`. Does it contain a pointer to your remote storage? Afterwards, make sure
     to add this file to the next commit we are going to make:
 
     ```bash
@@ -112,13 +112,13 @@ it contains excellent tutorials.
 
 6. Call the `dvc add` command on your data files exactly like you would add a file with `git` (you do not need to
     add every file by itself as you can directly add the `data/` folder). Doing this should create a human-readable
-    file with the extension `.dvc`. This is the *metafile*  as explained earlier that will serve as a placeholder for
-    your data. If you are on Windows and this step fail you may need to install `pywin32`. At the same time the `data/`
+    file with the extension `.dvc`. This is the *metafile* as explained earlier that will serve as a placeholder for
+    your data. If you are on Windows and this step fails you may need to install `pywin32`. At the same time, the `data`
     folder should have been added to the `.gitignore` file that marks which files should not be tracked by git. Confirm
     that this is correct.
 
 7. Now we are going to add, commit and tag the *metafiles* so we can restore to this stage later on. Commit and tag
-    the files, should look something like this:
+    the files, which should look something like this:
 
     ```bash
     git add data.dvc .gitignore
@@ -127,12 +127,12 @@ it contains excellent tutorials.
     ```
 
 8. Finally, push your data to the remote storage using `dvc push`. You will be asked to authenticate, which involves
-    copy-pasting the code in the link prompted. Checkout your Google drive folder. You will see that the data is not
-    in a recognizable format anymore due to the way that `dvc` packs and tracks the data. The boring details is that
+    copy-pasting the code in the link prompted. Check out your Google Drive folder. You will see that the data is not
+    in a recognizable format anymore due to the way that `dvc` packs and tracks the data. The boring detail is that
     `dvc` converts the data into [content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
-    which makes data much faster to get. Finally, make sure that your data is not stored in your github repository.
+    which makes data much faster to get. Finally, make sure that your data is not stored in your Github repository.
 
-    After authenticating the first time, dvc should be setup without having to authenticate again. If you for some
+    After authenticating the first time, `DVC` should be setup without having to authenticate again. If you for some
     reason encounter that dvc fails to authenticate, you can try to reset the authentication. Locate the file
     `$CACHE_HOME/pydrive2fs/{gdrive_client_id}/default.json` where `$CACHE_HOME` depends on your operating system:
 
@@ -158,7 +158,7 @@ it contains excellent tutorials.
     ```
 
     (assuming that you give them access right to the folder in your drive). Try doing this (in some other location
-    than your standard code) to make sure that the two commands indeed downloads both your code and data.
+    than your standard code) to make sure that the two commands indeed download both your code and data.
 
 10. Lets look about the process of updating our data. Remember the important aspect of version control is that we do not
     need to store explicit files called `data_v1.pt`, `data_v2.pt` etc. but just have a single `data.pt` that where we
@@ -168,6 +168,7 @@ it contains excellent tutorials.
 
 11. Redo the above steps, adding the new data using `dvc`, committing and tagging the metafiles e.g. the following
     commands should be executed (with appropriate input):
+
     `dvc add -> git add -> git commit -> git tag -> dvc push -> git push`.
 
 12. Lets say that you wanted to go back to the state of your data in v1.0. If the above steps have been done correctly,
@@ -178,13 +179,13 @@ it contains excellent tutorials.
     dvc checkout
     ```
 
-    confirm that you have reverted back to the original data.
+    confirm that you have reverted to the original data.
 
 13. (Optional) Finally, it is important to note that `dvc` is not only intended to be used to store data files but also
-    any other large files such as trained model weights (with billion of parameters these can be quite large). For
-    example if we always stored out best performing model in a file called `best_model.ckpt` then we can use `dvc` to
-    version control it, store it online and make it easy for other to download. Feel free to experiment with this using
-    your own model checkpoints.
+    any other large files such as trained model weights (with billions of parameters these can be quite large). For
+    example, if we always store our best-performing model in a file called `best_model.ckpt` then we can use `dvc` to
+    version control it, store it online and make it easy for others to download. Feel free to experiment with this using
+    your model checkpoints.
 
 ## 🧠 Knowledge check
 
@@ -210,7 +211,7 @@ it contains excellent tutorials.
 
 That's all for today. With the combined power of `git` and `dvc` we should be able to version control everything in
 our development pipeline such that no changes are lost (assuming we commit regularly). It should be noted that `dvc`
-offers such more than just data version control, so if you want to deep dive into `dvc` we recommend their
+offers more than just data version control, so if you want to deep dive into `dvc` we recommend their
 [pipeline](https://dvc.org/doc/user-guide/project-structure/pipelines-files) feature and how this can be used to setup
 version controlled [experiments](https://dvc.org/doc/command-reference/exp). Note that we are going to revisit `dvc`
-later for a more permanent (and large scale) storage solution.
+later for a more permanent (and large-scale) storage solution.
diff --git a/s5_continuous_integration/auto_docker.md b/s5_continuous_integration/auto_docker.md
@@ -39,53 +39,53 @@ not store our data in Github, we cannot copy it during the build process.
 2. Start by creating a [Docker Hub account](https://hub.docker.com/)
 
 3. Next, within Docker Hub create an access token by going to `Settings -> Security`. Click the `New Access Token`
-   button and give it a name that you recognize.
+    button and give it a name that you recognize.
 
 4. Copy the newly created access token and head over to your Github repository online. Go to
-   `Settings -> Secrets -> Actions` and click the `New repository secret`. Copy over the access token and give
-   it the name `DOCKER_HUB_TOKEN`. Additionally, add two other secrets `DOCKER_HUB_USERNAME` and `DOCKER_HUB_REPOSITORY`
-   that contains your docker username and docker repository name respectively.
+    `Settings -> Secrets -> Actions` and click the `New repository secret`. Copy over the access token and give
+    it the name `DOCKER_HUB_TOKEN`. Additionally, add two other secrets `DOCKER_HUB_USERNAME` and `DOCKER_HUB_REPOSITORY`
+    that contains your docker username and docker repository name respectively.
 
 5. Next we are going to construct the actual Github actions workflow file:
 
-   ```yaml
-   name: Docker Image CI
-
-   on:
-     push:
-       branches: [ master ]
-
-   jobs:
-     build:
-       runs-on: ubuntu-latest
-       steps:
-       - uses: actions/checkout@v2
-       - name: Build the Docker image
-         run: |
-           echo "${{ secrets.DOCKER_HUB_TOKEN }}" | docker login \
-             -u "${{ secrets.DOCKER_HUB_USERNAME }}" --password-stdin docker.io
-           docker build . --file Dockerfile \
-             --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
-           docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
-   ```
-
-   The first part of the workflow file should look somewhat recognizable. However, the last three lines are where
-   all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking
-   at the help page for `docker login`, `docker build` and `docker push`.
+    ```yaml
+    name: Docker Image CI
+
+    on:
+        push:
+            branches: [ master ]
+
+    jobs:
+        build:
+        runs-on: ubuntu-latest
+        steps:
+        - uses: actions/checkout@v2
+        - name: Build the Docker image
+            run: |
+            echo "${{ secrets.DOCKER_HUB_TOKEN }}" | docker login \
+                -u "${{ secrets.DOCKER_HUB_USERNAME }}" --password-stdin docker.io
+            docker build . --file Dockerfile \
+                --tag docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
+            docker push docker.io/${{ secrets.DOCKER_HUB_USERNAME }}/${{ secrets.DOCKER_HUB_REPOSITORY }}:$GITHUB_SHA
+    ```
+
+    The first part of the workflow file should look somewhat recognizable. However, the last three lines are where
+    all the magic happens. Carefully go through them and figure out what they do. If you want some help you can looking
+    at the help page for `docker login`, `docker build` and `docker push`.
 
 6. Upload the workflow to your github repository and check that it is being executed. If everything you should be able
-   to see the the build docker image in your container repository in docker hub.
+    to see the the build docker image in your container repository in docker hub.
 
 7. Make sure that you can execute `docker pull` locally to pull down the image that you just continuously build
 
 8. (Optional) To test that the container works directly in github you can also try to include an additional
-   step that actually runs the container.
+    step that actually runs the container.
 
-   ```yaml
-     - name: Run container
-       run: |
-         docker run ...
-   ```
+    ```yaml
+        - name: Run container
+          run: |
+            docker run ...
+    ```
 
 That ends the session on continues docker building. We are going to revisit this topic after introducing the basic
 concepts of working in the cloud, as it will make our life easier in the long run when we get to continues deployment