Project 5 — A diffusion model from scratch
The image generators that took the world by storm — Stable Diffusion, DALL·E, Midjourney, Sora — are diffusion models. Their idea sounds like magic but is beautifully simple: learn to reverse noise. Add noise to real data step by step until it's pure static; train a network to undo one step of that; then generate by starting from static and denoising into something real. This project builds a complete diffusion model on 2-D points so it trains in 15 seconds and you can measure that it works.
Full code: code/projects/diffusion.py (PyTorch, CPU-friendly).
The idea: destroy, then learn to rebuild
Two processes:
- Forward (fixed, no learning) — gradually add Gaussian noise to a real sample over $T$ steps until it's indistinguishable from random noise. This is just a recipe.
- Reverse (learned) — train a network to predict the noise that was added, so you can subtract it and step back toward real data.
Generate by running the reverse process from pure noise:
forward: data → +noise → +noise → … → pure noise (fixed)
reverse: noise → denoise → denoise → … → new data (learned)
The key trick: instead of learning to generate data directly (hard), the model learns the much easier task of predicting the noise in a noisy sample. Subtracting predicted noise is one denoising step. Do it repeatedly and noise becomes data.
The target & the model
Our "real data" is points on the unit circle — simple enough to verify numerically (generated points should land at radius ≈ 1), rich enough to be a real distribution.
T = 50 # number of noising steps
betas = torch.linspace(1e-4, 0.1, T) # how much noise per step (the schedule)
abar = torch.cumprod(1 - betas, dim=0) # cumulative noise level, ᾱ_t
class Denoiser(nn.Module): # predicts the noise in x_t, given t
def forward(self, x, t):
return self.net(torch.cat([x, (t/T).unsqueeze(1)], dim=1))
The model takes a noisy point x and the timestep t (how noisy it is) and outputs its
guess of the noise. It's a tiny MLP (Chapter 11) —
the architecture can be anything; the training objective is what makes it a diffusion
model.
Training: predict the noise
The whole training step, using the forward-process shortcut $x_t = \sqrt{\bar\alpha_t}, x_0 + \sqrt{1-\bar\alpha_t},\epsilon$:
x0 = sample_data(bs) # a real point
t = torch.randint(0, T, (bs,)) # a random noise level
noise = torch.randn_like(x0) # the noise we add
at = abar[t].unsqueeze(1)
x_t = at.sqrt()*x0 + (1-at).sqrt()*noise # the noisy version
loss = ((model(x_t, t) - noise) ** 2).mean() # predict that noise (MSE)
That's it — mean squared error between the model's predicted noise and the actual noise (Chapter 7). Simple regression, repeated over random samples and noise levels.
Generating: denoise from pure noise
Start from random noise and walk the reverse process step by step, subtracting predicted noise each time:
x = torch.randn(n, 2) # pure noise
for t in reversed(range(T)):
eps = model(x, t) # predict the noise
x = (x - (1-a)/(1-ab).sqrt() * eps) / a.sqrt() # remove a step of it
x = x + (noise term if t > 0) # add a little randomness back
Running it
$ python diffusion.py
Output:
step 0 loss 1.1303
step 500 loss 0.3930
step 1000 loss 0.3358
step 1500 loss 0.3416
generated 500 points
mean radius = 1.003 (target 1.000)
std radius = 0.114 (small = points really are on the circle)
We started generation from pure Gaussian noise and the model denoised it into points with mean radius 1.003 — it learned to generate the unit circle. The small std (0.114) confirms the points genuinely lie on the circle, not scattered randomly. The model created new data that matches the target distribution, from noise — that's generative modeling, and it's the exact process behind every diffusion image generator.
From circles to Stable Diffusion
The leap to generating photorealistic images is — once again — scale and a couple of additions, not a different idea:
- Bigger denoiser — a U-Net (a CNN, Chapter 13) instead of our MLP, to handle images.
- Latent diffusion — run the process in a compressed latent space (via an autoencoder) instead of raw pixels, for efficiency. This is the "Latent" in Latent Diffusion (Stable Diffusion).
- Text conditioning — feed a text embedding (Chapter 14) into the denoiser so the prompt steers generation. "A cat on the moon" → the model denoises toward images matching that text.
The training objective — predict the noise — is identical to what you just ran.
Don't be confused: diffusion vs. GANs vs. autoregressive. Three ways to generate. Diffusion denoises (today's leader for images). GANs pit a generator against a discriminator (older image method). Autoregressive models generate one token at a time — that's your GPT (the leader for text). Different mechanisms, same goal: model and sample a data distribution.
Make it production
- It's expensive to train — use a pretrained model (Stable Diffusion via Hugging
Face
diffusers) and fine-tune with LoRA (Chapter 32) for a custom style. - Generation is iterative (many denoising steps) → slow; techniques like distillation and fewer-step samplers speed it up for serving.
The takeaway — and the end of the book
A diffusion model learns to reverse noise: train it to predict the noise in a sample (simple MSE), then generate by denoising from pure noise. You built one that generates the unit circle from static — the exact process behind Stable Diffusion, which only adds a U-Net, a latent space, and text conditioning.
And that's the book. You started not knowing what a vector was; you've now built a transformer, fine-tuned it with LoRA, given it tools, trained a vision model, and generated data with diffusion — the complete modern AI stack, from first principles, on your own machine. You can define and code every term an ML engineer uses, reason from fundamentals when you meet a new one, and ship what you build. Go build something real. 🚀