The sculptor analogy
A sculptor does not "create" a statue out of nothing. They start with a rough block and chip away what is not the statue, one careful strike at a time. Every strike is small; the result emerges from many strikes in a row.
Diffusion models sculpt images out of noise. They start with a canvas of pure random static and remove a little bit of "wrongness" at every step. After 20–50 steps, the static has become a coherent image of whatever the prompt asked for.
The two-process trick
Training a diffusion model has two halves:
- Forward process (noising) — take a real image and gradually add Gaussian noise over T steps. After enough steps, the image becomes pure static. This process is fixed, no learning.
- Reverse process (denoising) — train a neural network to predict the noise that was added at each step. Given a noisy image, the model says "here is what I think the noise looks like; subtract it."
At sampling time, start from pure static and run the reverse process T times. Each step removes a little noise. Out the other end: a clean image.
Why it produces such good images
- Iterative refinement — every step is a small course-correction. Errors do not compound the way they do in autoregressive image generation.
- Probabilistic — the model is sampling from a distribution, not picking a single greedy answer. Diversity comes for free.
- Conditioning is easy — text, depth maps, edges, sketches all attach as extra input to the denoiser. Hence Stable Diffusion, ControlNet, image-to-image, etc.
Latent diffusion — the actual production trick
Running diffusion in raw pixel space is expensive (a 1024×1024 image is 3M+ values per step). Latent diffusion (the technique behind Stable Diffusion):
- A small VAE compresses the image into a tiny latent (e.g. 64×64×4).
- Diffusion happens in latent space, much cheaper.
- The VAE decodes the final latent back into a full-resolution image.
This is the difference between "research demo" and "runs on a consumer GPU."
Sampling steps and schedulers
Naive diffusion uses ~1000 noising steps. At sample time, smarter schedulers (DDIM, DPM++, Euler) compress the reverse process to 20–50 steps with little quality loss. Newer "flow matching" and consistency models push this to 1–4 steps. The trade-off:
- More steps → higher fidelity, slower.
- Fewer steps → faster, occasional artifacts.
Most production stacks default around 20–30 steps.
Where diffusion goes beyond images
- Video — temporal diffusion across frames (Sora, Veo, Kling).
- Audio / music — diffusion in spectrogram or latent audio space.
- 3D shapes — diffusion over point clouds or NeRF parameters.
- Molecular design — diffusion over molecular graphs for drug discovery.
The pattern "destroy with noise, learn to undo" generalises remarkably well.
Practical knobs
- Sampling steps — quality vs latency lever.
- Guidance scale (CFG) — how strongly to follow the prompt. Too high = oversaturated, distorted. Too low = ignores the prompt. 5–9 is typical for text-to-image.
- Seed — same seed + same prompt = same image. Makes results reproducible and lets you do controlled comparisons.
- Negative prompt — "what not to include." Surprisingly powerful, especially for fixing common artifacts (extra fingers, watermarks, etc.).