A learning companion for the course "How Diffusion Models Work" by DeepLearning.AI and instructor Sharon Zhou.
This README is a structured walkthrough of the main concepts covered in the course, designed for students and learners who want to dive deep into diffusion models and their underlying mechanics.
Diffusion models are a family of generative models that have rapidly advanced the field of AI-generated media, powering tools like Stable Diffusion, DALL·E 2, and Imagen.
At their core, diffusion models learn to reverse a gradual noising process:
- Take real data (e.g., an image).
- Add noise step by step until it becomes pure noise.
- Train a neural network to learn the reverse process—removing noise step by step until you reconstruct the original distribution.
One of the most well-known implementations is DDPM (Denoising Diffusion Probabilistic Models), which this course explores.
The intuition behind diffusion models comes from forward and reverse diffusion processes:
-
Forward Process (Adding Noise)
- Start with an image.
- Add a small amount of Gaussian noise repeatedly over many steps.
- Eventually, the image turns into pure white noise.
- This process is fixed, not learned.
-
Reverse Process (Removing Noise)
- Train a neural network to predict the noise that was added at each step.
- By subtracting the predicted noise, you get a cleaner image at each stage.
- Repeated many times, noise turns back into a meaningful image.
🔑 Key Idea: If a model can denoise well, it can generate new data from noise by reversing the process.
Sampling in diffusion models refers to the reverse generation process: turning noise into data.
-
DDPM Sampling Steps:
- Start with random Gaussian noise.
- At each step, use the trained neural network to predict the noise component.
- Subtract the noise → get a slightly cleaner image.
- Repeat until a final realistic image emerges.
-
Challenge:
- Typically requires hundreds or thousands of steps for high-quality results.
- Slow, but yields very sharp and realistic samples.
📌 Example: Generating a face image → begin with static noise → gradually refine → end up with a clear human face.
The neural network architecture is the engine of diffusion models:
- Usually a U-Net (encoder–decoder with skip connections).
- Input: a noisy image + timestep information.
- Output: predicted noise at that timestep.
- Encoder compresses the image → captures global context.
- Decoder reconstructs details → captures local textures.
- Skip connections preserve spatial details lost in compression.
- Text, class labels, or other modalities can be added.
- Example: Text-to-image generation uses embeddings from models like CLIP or transformers.
Training a diffusion model means teaching it to predict the noise added at each step.
- Take an image from the dataset.
- Pick a random timestep
t. - Add Gaussian noise to the image according to the forward diffusion schedule.
- Feed the noisy image + timestep into the neural network.
- Train the network to output the exact noise that was added.
🔑 Loss Function: Usually a simple Mean Squared Error (MSE) between the predicted noise and the true noise.
If the model can predict noise accurately at any step, it can reverse the process for sampling.
Diffusion models are powerful but computationally expensive. Researchers have developed methods to make them faster and more controllable:
- Fewer Steps: Use techniques like DDIM (Denoising Diffusion Implicit Models) to cut down steps while maintaining quality.
- Noise Schedules: Modify how noise is added/removed for more efficient denoising.
- Classifier Guidance: Steer generation towards a target class by using gradients from a classifier.
- Classifier-Free Guidance: Train the model to condition and uncondition on prompts; combine them at inference for stronger control.
- Prompt Engineering: In text-to-image systems, the quality of the description directly influences the final output.
- Diffusion models work by learning to denoise data step by step.
- DDPM provides the foundation: forward noise process + reverse learned denoising.
- Sampling is slow but high-quality, with improvements like DDIM speeding it up.
- U-Net architectures power most diffusion models, often conditioned on text or labels.
- Control techniques (guidance, schedules) give flexibility and efficiency.
This README is inspired by the course How Diffusion Models Work by DeepLearning.AI and instructor Sharon Zhou.
All credit for the original course content goes to them. This document is a learner’s structured summary for study and review purposes.