Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
ELI5/TLDR
This is the final lecture of a Stanford course on how AI generates images. The first half is a fast recap of the whole semester: how a machine learns to turn random static into a picture you asked for, step by step. The second half asks “now what else can we do with this trick?” — and the answer is videos, photo editing, and, surprisingly, making text-based chatbots faster. The instructors close by pointing at where the field is heading: cheaper images, robots, and the looming problem of AIs learning from other AIs’ mistakes.
The Full Story
The core trick: noise in, picture out
The whole course rests on one stubborn problem. The images you want a model to produce live inside a “distribution” — think of it as a vast cloud of all plausible pictures — and that cloud has no simple shape you can reach into and grab from. So the trick is sideways. You start from something you can sample easily — pure random static, the kind of grey snow an untuned TV shows — and you teach a model to walk that static, step by step, toward a real picture.
To learn that walk, you first do the opposite on purpose. Take a clean image and gradually smear noise over it until it’s unrecognizable. That’s the “forward process.” Diffusion is just learning to run that movie backwards: given a half-ruined image, guess what noise was added so you can subtract it.
“diffusion allowed us to go from an easy to sample from distribution… to the data distribution by learning how to remove the noise.”
The lecture walks through three different lenses on this same idea, each a bit more elegant than the last:
- Predict the noise (lecture 1). The model looks at a noisy image and guesses what mess was dumped on it.
- Follow the compass (lecture 2). Instead of asking “what noise is here,” ask “which direction is the nearest clean image?” That direction has a name, the score. Think of it like a compass needle at every point in space, always pointing toward where real pictures live.
- Follow the current (lecture 3, flow matching). Picture the static as one puddle of water and the target images as another. The job is to figure out the flow — a current with a direction and speed at every point — that carries the first puddle into the shape of the second. You learn that current, then just let a fresh drop of static ride it to shore.
The instructor is blunt about which one to remember: flow matching, and specifically a tidied-up version called rectified flow that straightens the path so you need fewer steps to get there. In 2026, that’s the default. The other two are scaffolding that helps you understand why it works.
Compress first, then generate
Two things the recap had quietly assumed: that we have a prompt to steer the picture, and that we even have a sensible way to represent an image. Lecture 4 fixed the second.
Raw pixels are a bad place to work. They’re enormous (millions of numbers) and wasteful — a patch of blue sky is the same value repeated thousands of times. So before generating anything, you squeeze the image down into a small, dense summary called a latent, using an autoencoder: one network crushes the image into a compact code, another reconstructs it. The squeeze forces the network to keep only what matters.
But a raw squeeze produces a lumpy, spiky space — easy to compress, miserable to generate in. The fix (the variational autoencoder, or VAE) adds a gentle pressure that tidies the latent space into something smooth and well-organized, so the generator has an easy surface to walk on. Generation then happens in this small tidy space, and you only decode back to pixels at the very end.
The machine that does the work
Lectures 5 and 6 covered the actual engine and how to train it. Early models used a U-Net — a shape that zooms out to grasp the whole image, then zooms back in. The catch: two corners of an image far apart can’t easily “talk.” That matters for a prompt like a teddy bear looking at itself in a mirror, where both sides must agree. The fix was the diffusion transformer, which lets every patch attend to every other patch directly. Today’s best models almost all use this.
A nice training detail: not all noise levels are equally hard. The truly clean and the truly destroyed ends are easy. The murky middle — where the model must commit to what the image even is — is where the real decisions get made, so modern training deliberately spends more time there.
Training also has stages: an expensive pre-training run on a giant image corpus; optional continued training to specialize (teach a nature model to draw teddy bears); and fine-tuning to lock in one specific subject — the DreamBooth trick, where you show it five photos of a particular person and bind them to a rare made-up word, plus LoRA, which tweaks only a sliver of the weights instead of all of them. Finally, distillation shrinks the number of steps so the thing is cheap enough to actually ship.
How do you know it’s any good
Lecture 7’s question. The gold standard is humans comparing pairs of images, scored with an ELO rating — the same system chess uses. The clever bit: beating a strong model should count for more than beating a weak one.
“if you won against a strong model then you want your delta to be high and if you won against a weak model then it’s not like a lot of signal for you because… almost everyone wins against that weak model”
Humans are expensive, so there are automated proxies too: the FID score measures how far the cloud of generated images sits from the cloud of real ones (lower is better), and increasingly you can hand the job to a multimodal model and ask it to judge — fast feedback, with humans brought in only when the model seems good enough to deserve it.
A reality check: the best models still use what we taught
To prove the course wasn’t a fairy tale, the instructor pulled up a live leaderboard. The very top spots are closed labs (OpenAI, Google, xAI) that publish nothing. But the best open models — Flux 2, Qwen Image, and a new top-ranked one called HiDream — are mostly built from exactly these parts: rectified flow, a diffusion transformer, a VAE, a pretrained text encoder.
The interesting wrinkle: the newest top model drops two of those pieces — no VAE, no pretrained text encoder, generating directly in pixels. Doesn’t that break the lesson? Not quite. It makes pixels tractable by using much bigger patches, and it simply shifts all the hard work onto a brutally large model (up to 200 billion parameters, huge for image generation). The lesson holds — the compressed latent space makes learning easier, but it quietly throws away some fidelity. If you scale hard enough, you might not need the crutch. A trend worth watching, not yet a verdict.
Where the trick travels next
Video is just images with a time axis stacked on. Two new worries: keep frames temporally consistent (the teddy bear can’t sprout sunglasses between frames and lose them the next), and keep it tractable, since time multiplies your data. The fix mirrors images: compress in space and time into “space-time latents,” using a causal VAE — one that only looks at the current and earlier frames, never the future, so you can stream the encoding without running out of memory. The first frame gets special “anchor” treatment, and long videos are stitched by handing the last frame of one clip in as the first frame of the next.
Image editing exposes a flaw in the naive approach. Ask “make this black and white” and a from-scratch generator might also raise the teddy bear’s arm — it regenerated everything. The smarter framing: treat it as editing, not generation. Have a vision-language model output constrained actions (“reduce brightness 50%”) that a tool like Photoshop applies, guaranteeing the rest of the image survives. The hard part is teaching the model what edits are even possible — researchers do it by mining editing logs and inferring the user’s intent from before/after image pairs.
Diffusion for text is the surprising one. Chatbots today write one word at a time, each word waiting on the last — slow for long outputs. Diffusion suggests denoising the whole answer at once, coarse to fine.
“let’s suppose you’re writing a speech… you typically don’t write things in a very sequential way. You first start with a draft… and then what you do is you refine, go from coarse to fine grain.”
Text is discrete, so “noise” can’t be Gaussian static — instead you randomly mask tokens and train the model to fill the blanks at varying noise levels. At inference you start from an all-masked sequence and progressively reveal, re-masking the tokens you’re unsure about. Reported speedups reach 10x, especially good for coding (where you often fill in the middle, not just append). The cost: training is far more expensive and many auto-regressive tricks don’t transfer. Block diffusion hybrids and startups like Inception are chasing it.
Closing thoughts
The second instructor offered a metric to watch: roughly 10 cents per megapixel for a top-tier image today — track that toward commodity. Near-term frontiers: genuine reasoning with images (precise diagrams, not vague projections), multimodal synthesis (fuse slides + video + audio of a lecture), and eventually robotics and medicine. The looming danger is model collapse — as the internet fills with AI images, the next generation trains on its own echoes, an “echo chamber of mistakes that keep growing.” Countermeasures: provenance standards (C2PA), invisible watermarks (Google’s SynthID), and law catching up on safety.
Key Takeaways
- Image generation reframes an impossible sampling problem: start from easy-to-sample random noise, learn to walk it step by step into the hard-to-sample distribution of real images.
- Three equivalent framings — predict the noise (diffusion), follow the score (a compass toward clean images), and follow the flow (a current carrying noise to data). Flow matching, specifically rectified flow, is the 2026 default because straighter paths need fewer sampling steps.
- Generation happens in a compressed latent space, not raw pixels. A VAE crushes images small and tidies the space so it’s easy to generate in; redundant pixels (sky, correlated neighbors) make this lossless-enough.
- The diffusion transformer beat the older U-Net because it lets every image patch attend to every other patch directly — essential for globally coherent images.
- Mid-range noise levels are the hardest (that’s where the model decides what the image is), so modern training samples them more often via a logit-normal distribution instead of uniform.
- Training stages: pre-training (expensive, broad), continued training (specialize), fine-tuning (DreamBooth binds a subject to a rare token; LoRA tunes only a few weights), then distillation to cut inference steps for deployment.
- Evaluation uses ELO ratings from human pairwise comparisons (beating a strong model counts more), plus automated proxies like FID (distance between generated and real image distributions) and multimodal-model-as-judge.
- Top open models (Flux 2, Qwen Image, HiDream) are built from the course’s exact ingredients; the newest drops the VAE and pretrained encoder, generating in pixel space with huge patches and up to 200B parameters — suggesting the latent-space crutch can be scaled away.
- Video = images plus time. Compress along both space and time into space-time latents using a causal VAE (each frame depends only on itself and earlier frames, enabling streaming); the first frame is a special anchor.
- Image editing works better framed as constrained edits (VLM emits tool actions) than as from-scratch regeneration, which fails to preserve the original.
- Diffusion can speed up text generation ~10x by denoising a whole masked answer at once instead of one token at a time — great for coding and fill-in-the-middle, but training is costlier and many auto-regressive tricks don’t transfer.
- Watch-list: ~10 cents per megapixel as the commodity benchmark; model collapse from AI training on AI output; provenance (C2PA) and watermarking (SynthID) as defenses.
Claude’s Take
This is a genuinely good capstone lecture — a clean, honest map of a field that usually drowns newcomers in notation. The instructor’s restraint is the strength: he keeps saying “I’ll spare you the derivations” and instead hands you the intuition and a single load-bearing takeaway (learn flow matching, the rest is scaffolding). The teddy-bear-as-running-example device is a small thing that does a lot of work.
What earns trust is the reality check against the live leaderboard. Rather than claiming the course teaches you everything, he shows that the newest top model breaks one of the lessons (drops the VAE), then explains why the lesson still holds — and explicitly refuses to over-generalize from one two-week-old paper. That’s the right epistemic posture, and it’s rare in ML pedagogy.
The weaker stretch is the very end. The “closing thoughts” from the second instructor drift toward the speculative — robotics, medicine, “maybe one day you won’t need to come to lecture.” It’s the obligatory inspirational sign-off, lighter on substance than the rest. The 10-cents-per-megapixel metric is a nice concrete anchor amid the hand-waving.
Score: 8. It’s a recap lecture, so it’s broad rather than deep, and you’d want the earlier seven lectures to truly grasp any one piece. But as a standalone tour of how modern image generation actually works — and where it’s bleeding into video, editing, and text — it’s lucid, honest, and well-paced. Docked from a 9 only because the forward-looking second half thins out.
Further Reading
- Rectified Flow — the straightened-path variant of flow matching the lecture calls the modern default. Worth chasing if you want the one idea that underpins today’s models.
- Diffusion Transformers (DiT), Peebles & Xie, 2022 — the architecture that replaced the U-Net and now dominates.
- DreamBooth — the fine-tuning method for binding a model to a specific subject from a handful of photos.
- LoRA (Low-Rank Adaptation) — the technique for tuning only a small slice of a large model’s weights.
- Block Diffusion — the hybrid that blends auto-regressive and diffusion text generation to handle variable output length.
- DeepSeek-OCR (referenced as “deepser”) — treats text as images, a promising angle on token efficiency.
- Stanford CS231N — the recommended companion computer-vision course.