Skip to content

Commit

Permalink
Lecture 12
Browse files Browse the repository at this point in the history
  • Loading branch information
glouppe committed May 8, 2024
1 parent 0dd57eb commit 272ad5d
Show file tree
Hide file tree
Showing 10 changed files with 418 additions and 88 deletions.
8 changes: 7 additions & 1 deletion closing.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ The models covered in this course have broad applications in artificial intellig

class: middle

The field of deep learning is evolving rapidly. What you have learned in this course is just the beginning!

---

class: middle

## Exam

- 1 question on the fundamentals of deep learning (lectures 1 to 4)
Expand Down Expand Up @@ -73,7 +79,7 @@ class: black-slide, middle
- Deep Learning is more than feedforward networks.
- It is a .bold[methodology]:
- assemble networks of parameterized functional blocks
- train them from examples using some form of gradient-based optimisation.
- train them from data using some form of gradient-based optimisation.
- Bricks are simple, but their nested composition can be arbitrarily complicated.
- Think like an architect: make cathedrals!
]
Expand Down
109 changes: 48 additions & 61 deletions code/lec11-vae.ipynb

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions course-syllabus.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ Prof. Gilles Louppe<br>

R: paper https://t.co/wVg6xUmt7d

Q&A: ask me anything (during the course, on any topic, questions collected on a platform)
Give examples more related to engineering and science. Focus on people rather than technology.

---

# Us
Expand Down
Binary file added figures/lec11/embedding0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/lec12/ald.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
255 changes: 255 additions & 0 deletions figures/lec12/assimilation.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/lec12/sda-qg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 11 additions & 1 deletion lecture11.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,15 @@ count: false

class: middle

.center.width-90[![](figures/lec11/embedding0.png)]

.footnote[Credits: Francois Fleuret, [Deep Learning](https://fleuret.org/dlc/), UNIGE/EPFL.]

---

class: middle
count: false

.center.width-90[![](figures/lec11/embedding1.png)]

.footnote[Credits: Francois Fleuret, [Deep Learning](https://fleuret.org/dlc/), UNIGE/EPFL.]
Expand Down Expand Up @@ -558,7 +567,8 @@ Unbiased gradients of the ELBO with respect to the generative model parameters $
$$\begin{aligned}
\nabla\_\theta \text{ELBO}(\mathbf{x};\theta,\phi) &= \nabla\_\theta \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \log p\_\theta(\mathbf{x},\mathbf{z}) - \log q\_\phi(\mathbf{z}|\mathbf{x})\right] \\\\
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta ( \log p\_\theta(\mathbf{x},\mathbf{z}) - \log q\_\phi(\mathbf{z}|\mathbf{x}) ) \right] \\\\
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta \log p\_\theta(\mathbf{x},\mathbf{z}) \right],
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta \log p\_\theta(\mathbf{x},\mathbf{z}) \right] \\\\
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta \log p\_\theta(\mathbf{x} | \mathbf{z}) \right],
\end{aligned}$$
which can be estimated with Monte Carlo integration.

Expand Down
119 changes: 94 additions & 25 deletions lecture12.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,6 @@ Lecture 12: Diffusion models
Prof. Gilles Louppe<br>
[g.louppe@uliege.be](mailto:g.louppe@uliege.be)

???

Good references:
- https://arxiv.org/pdf/2208.11970.pdf
- https://cvpr2022-tutorial-diffusion-models.github.io/
- Understanding Deep Learning book
- Continuous : infinite noise levels https://www.youtube.com/watch?v=wMmqCMwuM2Q (build some intuition first)

- Rewrite to better match the sidenotes
- Give more intuition about the score function and about the annealing schedule

---

# Today
Expand Down Expand Up @@ -113,6 +102,16 @@ class: middle

class: middle

## Data assimilation in ocean models

.center.width-65[![](./figures/lec12/sda-qg.png)]

.footnote[Credits: [Rozet and Louppe](https://arxiv.org/pdf/2306.10574.pdf), 2023.]

---

class: middle

# VAEs

A short recap.
Expand Down Expand Up @@ -141,27 +140,23 @@ $$\begin{aligned}
&= \arg \max\_{\theta,\phi} \mathbb{E}\_{p(\mathbf{x})} \left[ \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \log p\_\theta(\mathbf{x}|\mathbf{z})\right] - \text{KL}(q\_\phi(\mathbf{z}|\mathbf{x}) || p(\mathbf{z})) \right].
\end{aligned}$$

.alert[Issue: The prior matching term limits the expressivity of the model.]

---

class: middle
class: middle, black-slide, center
count: false

The prior matching term limits the expressivity of the model.
Solution: Make $p(\mathbf{z})$ a learnable distribution.

Solution: Make $p(\mathbf{z})$ a learnable distribution.
.width-80[![](figures/lec12/deeper.jpg)]

???

Explain the maths on the black board, taking the expectation wrt $p(\mathbf{x})$ of the ELBO and consider the expected KL terms.

---

class: middle, black-slide, center
count: false

.width-80[![](figures/lec12/deeper.jpg)]

---

class: middle

## (Markovian) Hierarchical VAEs
Expand Down Expand Up @@ -262,6 +257,12 @@ class: middle

.center.width-100[![](figures/lec12/diffusion-kernel-1.png)]

.center[

Diffusion kernel $q(\mathbf{x}\_t | \mathbf{x}\_{0})$ for different noise levels $t$.

]

.footnote[Credits: [Simon J.D. Prince](https://udlbook.github.io/udlbook/), 2023.]

---
Expand All @@ -270,6 +271,12 @@ class: middle

.center.width-100[![](figures/lec12/diffusion-kernel-2.png)]

.center[

Marginal distribution $q(\mathbf{x}\_t)$.

]

.footnote[Credits: [Simon J.D. Prince](https://udlbook.github.io/udlbook/), 2023.]

---
Expand Down Expand Up @@ -416,6 +423,8 @@ $$\begin{aligned}

class: middle

In summary, training and sampling thus eventually boils down to:

.center.width-100[![](figures/lec12/algorithms.png)]

???
Expand All @@ -428,7 +437,7 @@ class: middle

## Network architectures

Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent $\hat{\mathbf{x}}\_\theta(\mathbf{x}\_t, t)$ or $\epsilon\_\theta(\mathbf{x}\_t, t)$.
Diffusion models often use U-Net architectures (at least for image data) with ResNet blocks and self-attention layers to represent $\hat{\mathbf{x}}\_\theta(\mathbf{x}\_t, t)$ or $\epsilon\_\theta(\mathbf{x}\_t, t)$.

<br>

Expand All @@ -446,11 +455,71 @@ class: middle

class: middle

The .bold[score function] $\nabla\_{\mathbf{x}\_0} \log q(\mathbf{x}\_0)$ is a vector field that points in the direction of the highest density of the data distribution $q(\mathbf{x}\_0)$.
## Score-based models

Maximum likelihood estimation for energy-based probabilistic models $$p\_{\theta}(\mathbf{x}) = \frac{1}{Z\_{\theta}} \exp(-f\_{\theta}(\mathbf{x}))$$ can be intractable when the partition function $Z\_{\theta}$ is unknown.
We can sidestep this issue with a score-based model $$s\_\theta(\mathbf{x}) \approx \nabla\_{\mathbf{x}} \log p(\mathbf{x})$$ that approximates the (Stein) .bold[score function] of the data distribution. If we parameterize the score-based model with an energy-based model, then we have $$s\_\theta(\mathbf{x}) = \nabla\_{\mathbf{x}} \log p\_{\theta}(\mathbf{x}) = -\nabla\_{\mathbf{x}} f\_{\theta}(\mathbf{x}) - \nabla\_{\mathbf{x}} \log Z\_{\theta} = -\nabla\_{\mathbf{x}} f\_{\theta}(\mathbf{x}),$$
which discards the intractable partition function and expands the family of models that can be used.

---

class: middle

The score function points in the direction of the highest density of the data distribution.
It can be used to find modes of the data distribution or to generate samples by .bold[Langevin dynamics] by iterating the following sampling rule
$$\mathbf{x}\_{i+1} = \mathbf{x}\_i + \epsilon \nabla\_{\mathbf{x}\_i} \log p(\mathbf{x}\_i) + \sqrt{2\epsilon} \mathbf{z}\_i,$$
where $\epsilon$ is the step size and $\mathbf{z}\_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. When $\epsilon$ is small, Langevin dynamics will converge to the data distribution $p(\mathbf{x})$.

.center.width-30[![](figures/lec12/langevin.gif)]

.footnote[Credits: [Song](https://yang-song.net/blog/2021/score/), 2021.]

---

class: middle

Similarly to likelihood-based models, score-based models can be trained by minimizing the .bold[Fisher divergence] between the data distribution $p(\mathbf{x})$ and the model distribution $p\_\theta(\mathbf{x})$ as
$$\mathbb{E}\_{p(\mathbf{x})} \left[ || \nabla\_{\mathbf{x}} \log p(\mathbf{x}) - s\_\theta(\mathbf{x}) ||\_2^2 \right].$$

---

class: middle

Unfortunately, the explicit score matching objective leads to inaccurate estimates in low-density regions, where few data points are available to constrain the score.

Since initial sample points are likely to be in low-density regions in high-dimensional spaces, the inaccurate score-based model will derail the Langevin dynamics and lead to poor sample quality.

.center.width-100[![](figures/lec12/pitfalls.jpg)]

.footnote[Credits: [Song](https://yang-song.net/blog/2021/score/), 2021.]

---

class: middle

To address this issue, .bold[denoising score matching] can be used to train the score-based model to predict the score of increasingly noisified data points.

For each noise level $t$, the score-based model $s\_\theta(\mathbf{x}\_t, t)$ is trained to predict the score of the noisified data point $\mathbf{x}\_t$ as
$$s\_\theta(\mathbf{x}\_t, t) \approx \nabla\_{\mathbf{x}\_t} \log p\_{t} (\mathbf{x}\_t)$$
where $p\_{t} (\mathbf{x}\_t)$ is the noise-perturbed data distribution
$$p\_{t} (\mathbf{x}\_t) = \int p(\mathbf{x}\_0) \mathcal{N}(\mathbf{x}\_t ; \mathbf{x}\_0, \sigma^2\_t \mathbf{I}) d\mathbf{x}\_0$$
and $\sigma^2\_t$ is an increasing sequence of noise levels.

---

class: middle

The training objective for $s\_\theta(\mathbf{x}\_t, t)$ is then a weighted sum of Fisher divergences for all noise levels $t$,
$$\sum\_{t=1}^T \lambda(t) \mathbb{E}\_{p\_{t}(\mathbf{x}\_t)} \left[ || \nabla\_{\mathbf{x}\_t} \log p\_{t}(\mathbf{x}\_t) - s\_\theta(\mathbf{x}\_t, t) ||\_2^2 \right]$$
where $\lambda(t)$ is a weighting function that increases with $t$ to give more importance to the noisier samples.

---

class: middle

It can be used to find modes of the data distribution or to generate samples by Langevin dynamics.
Finally, annealed Langevin dynamics can be used to sample from the score-based model by running Langevin dynamics with decreasing noise levels $t=T, ..., 1$.

.center.width-40[![](figures/lec12/langevin.gif)]
.center.width-100[![](figures/lec12/ald.gif)]

.footnote[Credits: [Song](https://yang-song.net/blog/2021/score/), 2021.]

Expand Down
Binary file modified pdf/lec12.pdf
Binary file not shown.

0 comments on commit 272ad5d

Please sign in to comment.