diff --git a/.devcontainer/Dockerfile b/.devcontainer/Dockerfile index e93a13523..85428200a 100644 --- a/.devcontainer/Dockerfile +++ b/.devcontainer/Dockerfile @@ -1,4 +1,4 @@ -FROM python:3.11-slim-buster +FROM python:3.11-slim RUN apt update && \ apt install --no-install-recommends -y build-essential gcc && \ diff --git a/.github/workflows/repo_scraper.yml b/.github/workflows/repo_scraper.yml index e1b12aac3..92a14cf3d 100644 --- a/.github/workflows/repo_scraper.yml +++ b/.github/workflows/repo_scraper.yml @@ -1,9 +1,9 @@ name: Run repo scraper on: - #schedule: - # - cron: "0 0 * * *" # Run at the end of every day - workflow_dispatch: {} # manual executions + schedule: + - cron: "0 */6 * * *" # Runs every 6 hours + workflow_dispatch: {} # Allows manual executions jobs: scrape: diff --git a/README.md b/README.md index 44bc4dbff..bf8adabd5 100644 --- a/README.md +++ b/README.md @@ -107,7 +107,7 @@ this. Finally, you may try to cut the cost of running your model in production, and trying to optimize some steps. The focus in this course is particularly on the **Operations** part of MLOps as this is what many data scientists are -missing in their toolbox to take all the knowledge they have about data processing and model development into a +missing in their toolbox to implement all the knowledge they have about data processing and model development into a production setting. ## ❔ Learning objectives diff --git a/pages/timeplan.md b/pages/timeplan.md index 8e00f6e88..0b8f19284 100644 --- a/pages/timeplan.md +++ b/pages/timeplan.md @@ -20,8 +20,9 @@ be using in the exercises. Recordings (link to drive folder with mp4 files): -* [🎥2023 Lectures](https://drive.google.com/drive/folders/1j56XyHoPLjoIEmrVcV_9S1FBkXWZBK0w?usp=sharing) +* [🎥2025 Lectures](https://panopto.dtu.dk/Panopto/Pages/Sessions/List.aspx?folderID=14eeb1b7-5c39-4547-b7c3-b25d007cecd1) * [🎥2024 Lectures](https://drive.google.com/drive/folders/1mgLlvfXUT9xdg9EZusgeWAmfpUDSwfL6?usp=sharing) +* [🎥2023 Lectures](https://drive.google.com/drive/folders/1j56XyHoPLjoIEmrVcV_9S1FBkXWZBK0w?usp=sharing) ## Week 1 diff --git a/reports/README.md b/reports/README.md index a33d2712e..e3670456d 100644 --- a/reports/README.md +++ b/reports/README.md @@ -564,16 +564,17 @@ will check the repositories and the code to verify your answers. ### Question 31 > **State the individual contributions of each team member. This is required information from DTU, because we need to** -> **make sure all members contributed actively to the project** +> **make sure all members contributed actively to the project. Additionally, state if/how you have used generative AI** +> **tools in your project.** > -> Recommended answer length: 50-200 words. +> Recommended answer length: 50-300 words. > > Example: > *Student sXXXXXX was in charge of developing of setting up the initial cookie cutter project and developing of the* > *docker containers for training our applications.* > *Student sXXXXXX was in charge of training our models in the cloud and deploying them afterwards.* > *All members contributed to code by...* -> +> *We have used ChatGPT to help debug our code. Additionally, we used GitHub Copilot to help write some of our code.* > Answer: --- question 31 fill here --- diff --git a/reports/report.py b/reports/report.py index 3ee359b12..589550faf 100644 --- a/reports/report.py +++ b/reports/report.py @@ -156,7 +156,7 @@ def check() -> None: ] ), "question_30": LengthConstraints(min_length=200, max_length=400), - "question_31": LengthConstraints(min_length=50, max_length=200), + "question_31": LengthConstraints(min_length=50, max_length=300), } if len(answers) != 31: msg = "Number of answers are different from the expected 31. Have you changed the template?" diff --git a/requirements.txt b/requirements.txt index 1ccd5b58e..8d515f354 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,14 +2,14 @@ mkdocs-material == 9.5.49 mkdocs-glightbox == 0.4.0 mkdocs-material-extensions == 1.3.1 -pymdown-extensions == 10.13 +pymdown-extensions == 10.14 mkdocs-same-dir == 0.1.3 mkdocs-git-revision-date-localized-plugin == 1.3.0 mkdocs-exclude == 1.0.2 markdown-exec[ansi] == 1.10.0 # Developer stuff -ruff == 0.8.6 +ruff == 0.9.1 codespell == 2.3.0 pre-commit == 4.0.1 diff --git a/s1_development_environment/deep_learning_software.md b/s1_development_environment/deep_learning_software.md index eed664572..4eb09d73e 100644 --- a/s1_development_environment/deep_learning_software.md +++ b/s1_development_environment/deep_learning_software.md @@ -169,7 +169,7 @@ these two commands: ```bash pip install gdown -gdown --folder https://drive.google.com/drive/folders/1ddWeCcsfmelqxF8sOGBihY9IU98S9JRP?usp=sharing +gdown --folder 'https://drive.google.com/drive/folders/1ddWeCcsfmelqxF8sOGBihY9IU98S9JRP?usp=sharing' ``` The data should be placed in a folder subfolder called `data/corruptedmnist` in the root of the project. Your overall @@ -215,7 +215,7 @@ future as you start to add more and more features. As subgoals, please fulfill t ??? example "Starting point for `data.py`" - ```python linenums="1" title="model.py" + ```python linenums="1" title="data.py" --8<-- "s1_development_environment/exercise_files/final_exercise/data.py" ``` @@ -236,7 +236,7 @@ future as you start to add more and more features. As subgoals, please fulfill t We have additionally in the solution added functionality for plotting the images together with the labels for inspection. Remember: all good machine learning starts with a good understanding of the data. - ```python linenums="1" hl_lines="17 18" title="model.py" + ```python linenums="1" hl_lines="17 18" title="data.py" --8<-- "s1_development_environment/exercise_files/final_exercise/data_solution.py" ``` @@ -245,7 +245,7 @@ future as you start to add more and more features. As subgoals, please fulfill t ```bash python main.py train --lr 1e-4 - python main.py evaluate trained_model.pt + python main.py evaluate model.pth ``` which can be implemented in various ways. We provide you with a starting script that uses the `typer` library to @@ -270,8 +270,8 @@ future as you start to add more and more features. As subgoals, please fulfill t "version": "0.2.0", "configurations": [ { - "name": "Python: Current File", - "type": "python", + "name": "Train", + "type": "debugpy", "request": "launch", "program": "${file}", "args": [ diff --git a/s1_development_environment/exercise_files/fc_model.py b/s1_development_environment/exercise_files/fc_model.py index f1b05a0fd..68bdcb463 100644 --- a/s1_development_environment/exercise_files/fc_model.py +++ b/s1_development_environment/exercise_files/fc_model.py @@ -8,7 +8,7 @@ class Network(nn.Module): Arguments: input_size: integer, size of the input layer output_size: integer, size of the output layer - hidden_layers: list of integers, the sizes of the hidden layers + hidden_layers: list of integers (one for each hidden layer), the sizes of the hidden layers """ diff --git a/s1_development_environment/exercise_files/final_exercise/main_solution.py b/s1_development_environment/exercise_files/final_exercise/main_solution.py index 58f1810f9..463348e4c 100644 --- a/s1_development_environment/exercise_files/final_exercise/main_solution.py +++ b/s1_development_environment/exercise_files/final_exercise/main_solution.py @@ -1,8 +1,8 @@ import matplotlib.pyplot as plt import torch import typer -from data import corrupt_mnist -from model import MyAwesomeModel +from data_solution import corrupt_mnist +from model_solution import MyAwesomeModel DEVICE = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu") diff --git a/s2_organisation_and_version_control/README.md b/s2_organisation_and_version_control/README.md index 98c708e61..0b53b2120 100644 --- a/s2_organisation_and_version_control/README.md +++ b/s2_organisation_and_version_control/README.md @@ -6,7 +6,7 @@ - ![](../figures/icons/git.png){align=right : style="height:100px;width:100px"} - Learn the basics of version control and how to use `git` to track changes to your code and collaborate with others. + Learn the basics of version control and how to use `git` to track changes in your code and collaborate with others. [:octicons-arrow-right-24: M5: Git](git.md) @@ -38,7 +38,7 @@ Today we take our first steps into the world of MLOps. The set of modules in this session focuses on getting organized and making sure that you are familiar with good development practices. While many of the practices you will learn about -in these modules do not seem that important when you are a single person working on a project, it is crucial when +in these modules do not seem that important when you are a single person working on a project, it becomes crucial when working in large groups that the difference in how different people organize and write their code is minimized. The topics in this session will focus on: diff --git a/s2_organisation_and_version_control/cli.md b/s2_organisation_and_version_control/cli.md index 2ea5b29a3..61e64731f 100644 --- a/s2_organisation_and_version_control/cli.md +++ b/s2_organisation_and_version_control/cli.md @@ -160,7 +160,7 @@ for doing this, and of other excellent frameworks for creating command line inte app() ``` -3. Next, lets try on a bit harder example. Below is a simple script that trains a support vector machine on the iris +3. Next, let's try on a bit harder example. Below is a simple script that trains a support vector machine on the iris dataset. !!! example "iris_classifier.py" @@ -172,8 +172,8 @@ for doing this, and of other excellent frameworks for creating command line inte Implement a CLI for the script such that the following commands can be run ```bash - python iris_classifier.py train --output 'model.ckpt' # should train the model and save it to 'model.ckpt' - python iris_classifier.py train -o 'model.ckpt' # should be the same as above + python iris_classifier.py --output 'model.ckpt' # should train the model and save it to 'model.ckpt' + python iris_classifier.py -o 'model.ckpt' # should be the same as above ``` ??? success "Solution" @@ -186,7 +186,7 @@ for doing this, and of other excellent frameworks for creating command line inte --8<-- "s2_organisation_and_version_control/exercise_files/typer_exercise_solution.py" ``` -4. Next lets create a CLI that has more than a single command. Continue working in the basic machine learning +4. Next let's create a CLI that has more than a single command. Continue working in the basic machine learning application from the previous exercise, but this time we want to define two separate commands ```bash @@ -205,7 +205,7 @@ for doing this, and of other excellent frameworks for creating command line inte 5. Finally, let's try to define subcommands for our subcommands e.g. something similar to how `git` has the subcommand `remote` which in itself has multiple subcommands like `add`, `rename` etc. Continue on the simple machine - learning application from the previous exercises, but this time define a cli such that + learning application from the previous exercises, but this time define a CLI such that ```bash python iris_classifier.py train svm --kernel 'linear' @@ -222,7 +222,7 @@ for doing this, and of other excellent frameworks for creating command line inte --8<-- "s2_organisation_and_version_control/exercise_files/typer_exercise_solution3.py" ``` -6. (Optional) Let's try to combine what we have learned until now. Try to make your `typer` cli into a executable +6. (Optional) Let's try to combine what we have learned until now. Try to make your `typer` CLI into an executable script using the `pyproject.toml` file and try it out! ??? success "Solution" @@ -269,13 +269,13 @@ to interact with. Here is a example of long command that you might need to run i docker run -v $(pwd):/app -w /app --gpus all --rm -it my_image:latest python my_script.py --arg1 val1 --arg2 val2 ``` -This can be a lot to remember, and it can be easy to make mistakes. Instead it would be nice if we could just do +This can be a lot to remember, and it can be easy to make mistakes. Instead, it would be nice if we could just do ```bash run my_command --arg1=val1 --arg2=val2 ``` -e.g. easier to remember because we have remove a lot of the hard-to-remember stuff, but we are still able to configure +e.g. easier to remember because we have removed a lot of the hard-to-remember stuff, but we are still able to configure it to our liking. To help with this, we are going to look at the [invoke](http://www.pyinvoke.org/) package. `invoke` is a Python package that allows you to define tasks that can be run from the terminal. It is a bit like a more advanced version of the [Makefile](https://makefiletutorial.com/) that @@ -324,7 +324,7 @@ easier. invoke python ``` -4. Lets try to create a task that simplifies the process of `git add`, `git commit`, `git push`. Create a task such +4. Let's try to create a task that simplifies the process of `git add`, `git commit`, `git push`. Create a task such that the following command can be run ```bash diff --git a/s2_organisation_and_version_control/code_structure.md b/s2_organisation_and_version_control/code_structure.md index 82dcd4346..b6cd7e359 100644 --- a/s2_organisation_and_version_control/code_structure.md +++ b/s2_organisation_and_version_control/code_structure.md @@ -263,7 +263,7 @@ your head around where files are located. ??? success "Solution" - ```python linenums="1" title="make_dataset.py" + ```python linenums="1" title="data.py" --8<-- "s2_organisation_and_version_control/exercise_files/data_solution.py" ``` @@ -273,9 +273,14 @@ your head around where files are located. project. It is similar to `Markefile`s if you are familiar with them. Try out some of the pre-defined tasks: ```bash + # first install invoke + pip install invoke + # then you can execute the tasks invoke preprocess-data # runs the data.py file invoke requirements # installs all requirements in the requirements.txt file invoke train # runs the train.py file + # or get a list of all tasks + invoke --list ``` In general, we recommend that you add commands to the `tasks.py` file as you move along in the course. @@ -292,7 +297,7 @@ your head around where files are located. This is the CNN solution from yesterday and it may differ from the model architecture you have created. - ```python linenums="1" title="make_dataset.py" + ```python linenums="1" title="model.py" --8<-- "s2_organisation_and_version_control/exercise_files/model_solution.py" ``` @@ -304,7 +309,7 @@ your head around where files are located. ??? success "Solution" - ```python linenums="1" title="make_dataset.py" + ```python linenums="1" title="train.py" --8<-- "s2_organisation_and_version_control/exercise_files/train_solution.py" ``` 8. Transfer the remaining parts of the `main.py` script into the `src//evaluate.py` script e.g. the parts @@ -313,7 +318,7 @@ your head around where files are located. ??? success "Solution" - ```python linenums="1" title="make_dataset.py" + ```python linenums="1" title="evaluate.py" --8<-- "s2_organisation_and_version_control/exercise_files/evaluate_solution.py" ``` diff --git a/s2_organisation_and_version_control/dvc.md b/s2_organisation_and_version_control/dvc.md index 7ed8864ae..647feb3bd 100644 --- a/s2_organisation_and_version_control/dvc.md +++ b/s2_organisation_and_version_control/dvc.md @@ -182,7 +182,7 @@ it contains excellent tutorials. ```bash pip install gdown - gdown --folder https://drive.google.com/drive/folders/1JTjbom7IrB41Chx6uxLCN16ZwIxHHVw1?usp=sharing + gdown --folder 'https://drive.google.com/drive/folders/1JTjbom7IrB41Chx6uxLCN16ZwIxHHVw1?usp=sharing' ``` Copy the data to your `data/raw` folder and then rerun your data pipeline to incorporate the new data into the diff --git a/s2_organisation_and_version_control/exercise_files/visualize_solution.py b/s2_organisation_and_version_control/exercise_files/visualize_solution.py index 4baf40db3..e6c992a55 100644 --- a/s2_organisation_and_version_control/exercise_files/visualize_solution.py +++ b/s2_organisation_and_version_control/exercise_files/visualize_solution.py @@ -8,7 +8,8 @@ def visualize(model_checkpoint: str, figure_name: str = "embeddings.png") -> None: """Visualize model predictions.""" - model = MyAwesomeModel().load_state_dict(torch.load(model_checkpoint)) + model: torch.nn.Module = MyAwesomeModel() + model.load_state_dict(torch.load(model_checkpoint)) model.eval() model.fc = torch.nn.Identity() diff --git a/s3_reproducibility/config_files.md b/s3_reproducibility/config_files.md index cfc40a3a7..0d9ddd52c 100644 --- a/s3_reproducibility/config_files.md +++ b/s3_reproducibility/config_files.md @@ -197,7 +197,7 @@ look online for your answers before looking at the solution. Remember: its not a |--my_app.py ``` -12. Finally, a awesome feature of hydra is the +12. Finally, an awesome feature of hydra is the [instantiate](https://hydra.cc/docs/advanced/instantiate_objects/overview/) feature. This allows you to define a configuration file that can be used to directly instantiating objects in python. Try to create a configuration file that can be used to instantiating the `Adam` optimizer in the `vae_mnist.py` script. @@ -223,7 +223,10 @@ look online for your answers before looking at the solution. Remember: its not a @hydra.main(config_name="adam.yaml") def main(cfg): - optimizer = hydra.utils.instantiate(cfg.optimizer) + model = ... # define the model we want to optimize + # the first argument of any optimize is the parameters to optimize + # we add those dynamically when we instantiate the optimizer + optimizer = hydra.utils.instantiate(cfg.optimizer, params=model.parameters()) print(optimizer) if __name__ == "__main__": diff --git a/s3_reproducibility/docker.md b/s3_reproducibility/docker.md index cd3615c6f..bf7cd4057 100644 --- a/s3_reproducibility/docker.md +++ b/s3_reproducibility/docker.md @@ -167,7 +167,7 @@ beneficial for you to download. which will automatically remove the container after it has finished running. 9. Let's now move on to trying to construct a Dockerfile ourselves for our MNIST project. Create a file called - `trainer.dockerfile`. The intention is that we want to develop one Dockerfile for running our training script and + `train.dockerfile`. The intention is that we want to develop one Dockerfile for running our training script and one for doing predictions. 10. Instead of starting from scratch, we nearly always want to start from some base image. For this exercise, we are @@ -175,7 +175,7 @@ beneficial for you to download. ```docker # Base image - FROM python:3.9-slim + FROM python:3.11-slim ``` 11. Next, we are going to install some essentials in our image. The essentials more or less consist of a Python @@ -196,7 +196,7 @@ beneficial for you to download. ```docker COPY requirements.txt requirements.txt COPY pyproject.toml pyproject.toml - COPY / / + COPY src/ src/ COPY data/ data/ ``` @@ -226,7 +226,7 @@ beneficial for you to download. the application that we want to run when the image is being executed: ```docker - ENTRYPOINT ["python", "-u", "/train_model.py"] + ENTRYPOINT ["python", "-u", "src//train.py"] ``` The `"u"` here makes sure that any output from our script, e.g., any `print(...)` statements, gets redirected to @@ -235,7 +235,7 @@ beneficial for you to download. 13. We are now ready to build our Dockerfile into a Docker image. ```bash - docker build -f trainer.dockerfile . -t trainer:latest + docker build -f train.dockerfile . -t train:latest ``` ??? warning "MAC M1/M2 users" @@ -247,22 +247,22 @@ beneficial for you to download. to build for by adding the `--platform` argument to the `docker build` command: ```bash - docker build --platform linux/amd64 -f trainer.dockerfile . -t trainer:latest + docker build --platform linux/amd64 -f train.dockerfile . -t train:latest ``` and also when running the image: ```bash - docker run --platform linux/amd64 trainer:latest + docker run --platform linux/amd64 train:latest ``` Note that this will significantly increase the build and run time of your Docker image when running locally, because Docker will need to emulate the other platform. In general, for the exercises today, you should not need to specify the platform, but be aware of this if you are building Docker images on your own. - Please note that here we are providing two extra arguments to `docker build`. The `-f trainer.dockerfile .` (the dot + Please note that here we are providing two extra arguments to `docker build`. The `-f train.dockerfile .` (the dot is important to remember) indicates which Dockerfile we want to run (except if you named it just `Dockerfile`) and - the `-t trainer:latest` is the respective name and tag that we see afterward when running `docker images` (see + the `-t train:latest` is the respective name and tag that we see afterward when running `docker images` (see image below). Please note that building a Docker image can take a couple of minutes.
@@ -284,7 +284,7 @@ beneficial for you to download. then try running the docker image ```bash - docker run --name experiment1 trainer:latest + docker run --name experiment1 train:latest ``` you should hopefully see your training starting. Please note that we can start as many containers as we want at @@ -296,7 +296,7 @@ beneficial for you to download. in your Dockerfile that installs your requirements with: ```bash - RUN --mount=type=cache,target=~/pip/.cache pip install -r requirements.txt --no-cache-dir + RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt ``` which mounts your local pip cache to the Docker image. For building the image, you need to have enabled the @@ -340,7 +340,7 @@ beneficial for you to download. container as ```bash - docker run --name {container_name} -v %cd%/models:/models/ trainer:latest + docker run --name {container_name} -v %cd%/models:/models/ train:latest ``` this command mounts our local `models` folder as a corresponding `models` folder in the container. Any file save @@ -351,19 +351,19 @@ beneficial for you to download. for help. 17. With training done we also need to write an application for prediction. Create a new docker image called - `predict.dockerfile`. This file should call your `/models/predict_model.py` script instead. This image + `evaluate.dockerfile`. This file should call your `src//evaluate.py` script instead. This image will need some trained model weights to work. Feel free to either include these during the build process or mount them afterwards. When you create the file try to `build` and `run` it to confirm that it works. Hint: if - you are passing in the model checkpoint and prediction data as arguments to your script, your `docker run` probably + you are passing in the model checkpoint and evaluation data as arguments to your script, your `docker run` probably needs to look something like ```bash - docker run --name predict --rm \ + docker run --name evaluate --rm \ -v %cd%/trained_model.pt:/models/trained_model.pt \ # mount trained model file - -v %cd%/data/example_images.npy:/example_images.npy \ # mount data we want to predict on - predict:latest \ + -v %cd%/data/test_images.pt:/test_images.pt \ # mount data we want to evaluate on + -v %cd%/data/test_targets.pt:/test_targets.pt \ + evaluate:latest \ ../../models/trained_model.pt \ # argument to script, path relative to script location in container - ../../example_images.npy ``` 18. (Optional, requires GPU support) By default, a virtual machine created by docker only has access to your `cpu` and @@ -437,7 +437,7 @@ beneficial for you to download. also fairly easy as we just need to change our `FROM` statement at the beginning of our docker file: ```docker - FROM python:3.7-slim + FROM python:3.11-slim ``` change to @@ -468,7 +468,7 @@ beneficial for you to download. barebones for now, so let's just define a base installation of Python: ```docker - FROM python:3.11-slim-buster + FROM python:3.11-slim RUN apt update && \ apt install --no-install-recommends -y build-essential gcc && \ diff --git a/s3_reproducibility/exercise_files/vae_mnist.py b/s3_reproducibility/exercise_files/vae_mnist.py index 5d541b007..6023d8312 100644 --- a/s3_reproducibility/exercise_files/vae_mnist.py +++ b/s3_reproducibility/exercise_files/vae_mnist.py @@ -34,7 +34,7 @@ encoder = Encoder(input_dim=x_dim, hidden_dim=hidden_dim, latent_dim=20) decoder = Decoder(latent_dim=20, hidden_dim=hidden_dim, output_dim=x_dim) -model = Model(Encoder=encoder, Decoder=decoder).to(DEVICE) +model = Model(encoder=encoder, decoder=decoder).to(DEVICE) def loss_function(x, x_hat, mean, log_var): diff --git a/s4_debugging_and_logging/debugging.md b/s4_debugging_and_logging/debugging.md index 487b590ba..814385182 100644 --- a/s4_debugging_and_logging/debugging.md +++ b/s4_debugging_and_logging/debugging.md @@ -63,3 +63,114 @@ looking at the script). Successfully debugging and running the script should pro Again, we cannot stress enough that the exercise is actually not about finding the bugs but **using a proper** debugger to find them. + +??? success "Solution for device bug" + + If you look at the reparametrization function in the `Encoder` class you can see that we initialize a noise tensor + + ```python + def reparameterization(self, mean, var): + """Reparameterization trick to sample z values.""" + epsilon = torch.randn(*var.shape) + return mean + var * epsilon + ``` + + this will fail with a + `RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!` if + you are running on GPU, because the noise tensor is initialized on the CPU. You can fix this by initializing the + noise tensor on the same device as the mean and var tensors + + ```python + def reparameterization(self, mean, var): + """Reparameterization trick to sample z values.""" + epsilon = torch.randn(*var.shape, device=mean.device) + return mean + var * epsilon + ``` + +??? success "Solution for shape bug" + + In the `Decoder` class we initialize the following fully connected layers + + ```python + self.FC_hidden = nn.Linear(latent_dim, hidden_dim) + self.FC_output = nn.Linear(latent_dim, output_dim) + ``` + + which is used in the forward pass as + + ```python + def forward(self, x): + """Forward pass of the decoder module.""" + h = torch.relu(self.FC_hidden(x)) + return torch.sigmoid(self.FC_output(h)) + ``` + + this means that `h` should be a tensor of shape `[bs, hidden_dim]` but since we initialize the `FC_output` + layer with `latent_dim` output dimensions, the forward pass will fail with a + `RuntimeError: size mismatch, m1: [bs, hidden_dim], m2: [bs, latent_dim]` if `hidden_dim != latent_dim`. You can + fix this by initializing the `FC_output` layer with `hidden_dim` output dimensions + + ```python + self.FC_output = nn.Linear(hidden_dim, output_dim) + ``` + +??? success "Solution for math bug" + + In the Encoder class you have the following code + + ```python + def forward(self, x): + """Forward pass of the encoder module.""" + h_ = torch.relu(self.FC_input(x)) + mean = self.FC_mean(h_) + log_var = self.FC_var(h_) + z = self.reparameterization(mean, log_var) + return z, mean, log_var + + def reparameterization(self, mean, var): + """Reparameterization trick to sample z values.""" + epsilon = torch.randn(*var.shape) + return mean + var * epsilon + ``` + + from just the naming of the variables you can see that `log_var` is the log of the variance and not the variance + itself. This means that you should exponentiate `log_var` before using it in the `reparameterization` function + + ```python + z = self.reparameterization(mean, torch.exp(log_var)) + ``` + + alternatively, we can convert to using the standard deviation instead of the variance + + ```python + z = self.reparameterization(mean, torch.exp(0.5 * log_var)) + ``` + + and + + ```python + epsilon = torch.randn_like(std) + ``` + +??? success "Solution for training bug" + + Any training loop in PyTorch should have the following structure + + ```python + for epoch in range(num_epochs): + for batch in dataloader: + optimizer.zero_grad() + loss = model(batch) + loss.backward() + optimizer.step() + ``` + + if you look at the code for the training loop in the `vae_mnist_bugs.py` script you can see that the optimizer is + not zeroed before the backward pass. This means that the gradients will accumulate over the batches and will + explode. You can fix this by adding the line + + ```python + optimizer.zero_grad() + ``` + + as the first line of the inner training loop diff --git a/s4_debugging_and_logging/exercise_files/vae_mnist_working.py b/s4_debugging_and_logging/exercise_files/vae_mnist_working.py index e1cfe51d2..590e349c6 100644 --- a/s4_debugging_and_logging/exercise_files/vae_mnist_working.py +++ b/s4_debugging_and_logging/exercise_files/vae_mnist_working.py @@ -55,14 +55,9 @@ def forward(self, x): return z, mean, log_var - def reparameterization( - self, - mean, - std, - ): + def reparameterization(self, mean, std): """Reparameterization trick.""" epsilon = torch.randn_like(std) - return mean + std * epsilon diff --git a/s4_debugging_and_logging/logging.md b/s4_debugging_and_logging/logging.md index b76fc0736..8447604a5 100644 --- a/s4_debugging_and_logging/logging.md +++ b/s4_debugging_and_logging/logging.md @@ -492,28 +492,35 @@ metrics. This allows for better iteration of models and training procedure. 9. In the future it will be important for us to be able to run Wandb inside a docker container (together with whatever training or inference we specify). The problem here is that we cannot authenticate Wandb in the same way as the - previous exercise, it needs to happen automatically. Lets therefore look into how we can do that. + previous exercise, it needs to happen automatically. Let's therefore look into how we can do that. 1. First we need to generate an authentication key, or more precise an API key. This is in general the way any service (like a docker container) can authenticate. Start by going , click your profile - icon in the upper right corner and then go to settings. Scroll down to the danger zone and generate a new API - key and finally copy it. + icon in the upper right corner and then go to `User settings`. Scroll down to the danger zone and generate a + new API key (if you do not already have one) and finally copy it. 2. Next create a new docker file called `wandb.docker` and add the following code ```dockerfile - FROM python:3.10-slim + FROM python:3.11-slim RUN apt update && \ apt install --no-install-recommends -y build-essential gcc && \ apt clean && rm -rf /var/lib/apt/lists/* RUN pip install wandb - COPY s4_debugging_and_logging/exercise_files/wandb_tester.py wandb_tester.py + COPY wandb_tester.py wandb_tester.py ENTRYPOINT ["python", "-u", "wandb_tester.py"] ``` - please take a look at the script being copied into the image and afterwards build the docker image. + and a new script called `wandb_tester.py` that contains the following code - 3. When we want to run the image, what we need to do is including a environment variables that contains the API key + ```python + --8<-- "s4_debugging_and_logging/exercise_files/wandb_tester.py" + ``` + + and then build the docker image. These two files are just a very minimal setup to test that we can authenticate + a docker container with Wandb. + + 3. When we want to run the image, what we need to do is including an environment variable that contains the API key we generated. This will then authenticate the docker container with the wandb server: ```bash diff --git a/s4_debugging_and_logging/profiling.md b/s4_debugging_and_logging/profiling.md index 14980f91f..07f3117a8 100644 --- a/s4_debugging_and_logging/profiling.md +++ b/s4_debugging_and_logging/profiling.md @@ -16,9 +16,9 @@ At the bare minimum, the two questions a proper profiling of your program should * *“ How many times is each method in my code called?”* * *“ How long do each of these methods take?”* -The first question is important to priorities optimization. If two methods `A` and `B` have approximately the same +The first question can help us priorities what to optimize. If two methods `A` and `B` have approximately the same runtime, but `A` is called 1000 more times than `B` we should probably spend time optimizing `A` over `B` if we want -to speedup our code. The second question is gives itself, directly telling us which methods are the expensive to call. +to speed up our code. The second question is gives itself, directly telling us which methods are the expensive to call. Using profilers can help you find bottlenecks in your code. In this exercise we will look at two different profilers, with the first one being the [cProfile](https://docs.python.org/3/library/profile.html). `cProfile` is @@ -31,14 +31,46 @@ programs. script using the `-m` arg ```bash - python -m cProfile -o -s myscript.py + python -m cProfile -s myscript.py ``` -2. Try looking at the output of the profiling. Can you figure out which function took the longest to run? + to write the output to a file you can use the `-o` argument + + ```bash + python -m cProfile -s -o profile.txt myscript.py + ``` + + ??? example "Script to debug" + + ```python linenums="1" title="vae_mnist_working.py" + --8<-- "s4_debugging_and_logging/exercise_files/vae_mnist_working.py" + ``` + +2. Try looking at the output of the profiling. Can you figure out which function took the longest to run? How do you + show the content of the `profile.txt` file? + + ??? success "Solution" + + If you try to open `profile.txt` in a text editor you will see that it is not very human readable. To get a + better overview of the profiling you can use the `pstats` module to read the file and print the results in a + more readable format. For example, to print the 10 functions that took the longest time to run you can use the + following code: + + ```python + import pstats + p = pstats.Stats('profile.txt') + p.sort_stats('cumulative').print_stats(10) + ``` 3. Can you explain the difference between `tottime` and `cumtime`? Under what circumstances does these differ and when are they equal. + ??? success "Solution" + + `tottime` is the total time spent in the function excluding time spent in subfunctions. `cumtime` is the total + time spent in the function including time spent in subfunctions. Therefore, `cumtime` is always greater than + `tottime`. + 4. To get a better feeling of the profiled result we can try to visualize it. Python does not provide a native solution, but open-source solutions such as [snakeviz](https://jiffyclub.github.io/snakeviz/) exist. Try installing `snakeviz` and load a profiled run into it (HINT: snakeviz expect the run to have the file @@ -47,6 +79,31 @@ programs. 5. Try optimizing the run! (Hint: The data is not stored as torch tensor). After optimizing the code make sure (using `cProfile` and `snakeviz`) that the code actually runs faster. + ??? success "Solution" + + For consistency reasons, even though the data in the `MNIST` dataset class from `torchvision` is stored as + tensors, they are converted to + [PIL images before returned](https://github.com/pytorch/vision/blob/d3beb52a00e16c71e821e192bcc592d614a490c0/torchvision/datasets/mnist.py#L141-L143). + This is the reason the solution is initialize the dataclass with the transform + + ```python + mnist_transform = transforms.Compose([transforms.ToTensor()]) + ``` + + such that the data is returned as tensors. However, since data is already stored as tensors, calling this + transform every time you want to access the data is redundant and can be removed. The easiest way to do this is + to create a `TensorDataset` from the internal data and labels (which already are tensors). + + ```python + from torchvision.datasets import MNIST + from torch.utils.data import TensorDataset + # the class also internally normalize to [0,1] domain so we need to divide by 255 + train_dataset = MNIST(dataset_path, train=True, download=True) + train_dataset = TensorDataset(train_dataset.data.float() / 255.0, train_dataset.targets) + test_dataset = MNIST(dataset_path, train=False, download=True) + test_dataset = TensorDataset(test_dataset.data.float() / 255.0, test_dataset.targets) + ``` + ## PyTorch profiling Profiling machine learning code can become much more complex because we are suddenly beginning to mix different diff --git a/s5_continuous_integration/unittesting.md b/s5_continuous_integration/unittesting.md index 3f2990a67..bbe70d7a2 100644 --- a/s5_continuous_integration/unittesting.md +++ b/s5_continuous_integration/unittesting.md @@ -189,9 +189,9 @@ The following exercises should be applied to your MNIST repository def test_error_on_wrong_shape(): model = MyAwesomeModel() - with pytest.raises(ValueError, match='Expected input to a 4D tensor') + with pytest.raises(ValueError, match='Expected input to a 4D tensor'): model(torch.randn(1,2,3)) - with pytest.raises(ValueError, match='Expected each sample to have shape [1, 28, 28]') + with pytest.raises(ValueError, match='Expected each sample to have shape [1, 28, 28]'): model(torch.randn(1,1,28,29)) ``` diff --git a/s6_the_cloud/README.md b/s6_the_cloud/README.md index 8796936de..a3f3e64d8 100644 --- a/s6_the_cloud/README.md +++ b/s6_the_cloud/README.md @@ -41,7 +41,7 @@ of the biggest being: - Alibaba Cloud They all have slight advantages and disadvantages over each other. In this course, we are going to focus on Google -Cloud platform, because they have been kind enough to sponsor $50 of cloud credit to each student. If you happen to run +Cloud Platform, because they have been kind enough to sponsor $50 of cloud credit to each student. If you happen to run out of credit, you can also get some free credit for a limited amount of time when you sign up with a new account. What's important to note is that all these different cloud providers all have the same set of services and that learning how to use the services of one cloud provider in many cases translates to also knowing how to use the same services at diff --git a/s7_deployment/apis.md b/s7_deployment/apis.md index 7bbf21e78..99d258710 100644 --- a/s7_deployment/apis.md +++ b/s7_deployment/apis.md @@ -544,7 +544,7 @@ you can look through for help. 2. Next, create a `Dockerfile` with the following content ```Dockerfile - FROM python:3.9 + FROM python:3.11-slim WORKDIR /code COPY ./requirements.txt /code/requirements.txt diff --git a/s7_deployment/exercise_files/simple_fastapi_app.dockerfile b/s7_deployment/exercise_files/simple_fastapi_app.dockerfile index 8967942b6..877c96b80 100644 --- a/s7_deployment/exercise_files/simple_fastapi_app.dockerfile +++ b/s7_deployment/exercise_files/simple_fastapi_app.dockerfile @@ -1,4 +1,4 @@ -FROM python:3.9-slim +FROM python:3.11-slim EXPOSE $PORT diff --git a/slides/IntroToMLOps.pdf b/slides/IntroToMLOps.pdf index 45181fc5c..11f009eb9 100644 Binary files a/slides/IntroToMLOps.pdf and b/slides/IntroToMLOps.pdf differ diff --git a/tools/repo_stats/scraper.py b/tools/repo_stats/scraper.py index 034e3a058..291301f51 100644 --- a/tools/repo_stats/scraper.py +++ b/tools/repo_stats/scraper.py @@ -124,7 +124,7 @@ def main(): contributor.commits_pr += 1 commits += pr_commits - activity_matrix = create_activity_matrix(commits) + activity_matrix = create_activity_matrix(commits, max_delta=3) average_commit_length = sum([len(c) for c in commit_messages]) / len(commit_messages)