Inference, Diffusion, World Models, and More | YC Paper Club
ELI5 / TLDR
Five researchers stand up at a Y Combinator event and each explains one recent AI paper. The thread running through all of them is the same: machines have always been bottlenecked by two scarce things — how fast they can think, and how much data they need to learn. Each paper attacks one bottleneck with a clever trick: making a model answer faster, teaching a robot to imagine consequences before acting, or squeezing far more learning out of a tiny pile of data. None of it requires a giant new model — mostly it’s old ideas, dusted off and used in a smarter place.
The Full Story
Why thinking fast is the same as thinking well
The first speaker, a Stanford student, wants to convert the room into “inference enjoyers.” Inference is just the part where a trained model actually produces answers — the weights are baked, you’re now using them. He makes a claim worth holding onto:
Inference today is seen as a sort of cost or convenience lever. But in one, two or three years inference is going to be seen as a capability.
Here’s the bridge. Modern models get smarter the longer they’re allowed to “think” — to churn out more tokens of reasoning before answering. If that’s true, then how fast you can produce those tokens sets a ceiling on how smart the model can be in the time you’ve got. Speed becomes intelligence. So making answers come out faster isn’t a plumbing chore; it’s a way to make the thing smarter.
His trick is called speculative decoding, and the idea is older than AI — your laptop’s processor does the same thing. Imagine a slow, careful editor and a fast, sloppy intern. Producing text one word at a time is slow, so you let the intern (a tiny model) rattle off a guess for the next several words. Then the editor (the big, accurate model) checks all those guessed words in a single glance and keeps the ones it would plausibly have written itself, throwing away the rest. The reason this works is a quirk worth naming: checking a sentence is faster than writing it. A model can grade many words at once but can only write them one at a time.
The catch is that the intern and the editor have to take turns — the intern can’t guess the next batch until the editor finishes grading the last one. His paper, “speculative speculative decoding,” breaks that turn-taking. The intern starts guessing the next batch while the editor is still grading the current one, betting on the most likely grades. Get the bet right most of the time (they manage 80-90%) and the waiting nearly disappears. Result: 300 words per second out of a large model.
Teaching robots to imagine before they move
The next two papers are cousins, both about giving machines a mental model of the world. A DeepMind scientist presents diffusion model predictive control. Strip the jargon: before a robot moves, it would help if it could picture what happens next. So you give it two components — one that proposes a sequence of moves, and one that predicts where those moves land it (a “dynamics model,” or world model). A planner tries out proposals in imagination, scores them, and picks the best.
The elegant part is keeping these two pieces separate. Imagine a robot dog that learned to run, and then breaks an ankle. Because the “how to move” part and the “what happens when I move” part are stored separately, you only need to update the second one — re-teach it the new physics of a bad leg — and the dog recovers much of its gait. He also shows you can train it to run, then at the last minute change the goal to “jump,” and it jumps, without retraining. That flexibility is the payoff of separating what you want from how the world works.
The billion-dollar question hiding in a small model
Isaac Ward presents a world-model paper from Yann LeCun’s group, and flags the stakes plainly:
Yann LeCun’s raise of $1.03 billion dollars back in March basically just to train world models is sort of what this presentation is about.
The big debate: should an AI agent carry an explicit picture of the world in its head (model-based), or just map situations straight to actions with no mental picture (model-free)? Model-free is simpler and often works, but it’s brittle when it meets something new, and it can’t tell you when it’s confused. A model-based agent can.
Training a world model has a famous failure mode he calls collapse. The model has to learn a compact internal code for messy camera images and learn how actions change that code. The lazy shortcut the math keeps reaching for is to make every situation look identical in its head — predictions become trivially perfect because nothing ever changes. Think of a student who answers “it depends” to every exam question: technically never wrong, completely useless.
Most world models bolt on some special trick to prevent that collapse. This paper’s contribution is a cleaner trick. It insists the model’s internal representations stay spread out in a healthy, bell-curve-shaped spread rather than all squashing to one point — a single tidy mathematical rule (they call it SIGReg) instead of a pile of hand-tuned hacks. The reward: it runs about 50 times faster than rivals, fits on a single graphics card, and has only 15 million parameters — tiny by today’s standards.
The capability he’s most excited about is surprise quantification. Feed the model a trick scenario — secretly teleport an object — and you see a clean spike in its prediction error at the exact moment of the trick. The agent can sense its own confusion. A model-free system can’t do that natively, and for anything operating in the real world, knowing when you don’t know is gold.
Generalization was never a mystery
The fourth talk pushes back on AI’s own folklore. Practitioners often say it’s a “mystery” that bigger models generalize better — they invoke spooky-sounding phenomena (overparameterization, benign overfitting, double descent) as evidence we just don’t understand learning. Ashe, presenting Andrew Gordon Wilson’s paper, argues these aren’t mysteries; they yield to classical theory that people simply weren’t applying correctly.
The intuition that lands: a good model is an expressive hypothesis space with a soft inductive bias. Imagine clay flexible enough to take any shape (so it can fit anything, including pure noise), but with a gentle lean toward simple shapes. Show it random nonsense and it can memorize it; show it real structure and its lean toward simplicity makes it find the genuine pattern instead of the noise. There’s also a neat geometric fact: as models get bigger, the space of “flat” solutions — broad valleys rather than narrow spikes — grows explosively, and flat solutions are both simpler and more compressible, which is exactly what generalization theory predicts should help. No magic, just under-used math.
What if compute were free?
The closer is the one the host calls an obsession. Everyone optimizes AI for compute efficiency, but he flips the question: the internet’s text grows ~3% a year while pre-training compute grows 4-5x a year. We’re heading into a world that’s starved for data but drowning in compute. So — if data is fixed and compute is unlimited, how much can you learn?
Konwoo’s answer revives techniques older than the deep-learning boom. Starve a model down to 200 million tokens and the obvious move (train bigger, reuse the data) backfires — it just memorizes. But crank up regularization absurdly (penalties ~30x stronger than normal) and the loss falls along a clean, predictable curve. Stack ensembling on top — train a fistful of small models and average them, like polling many forecasters instead of trusting one — and you do even better; a committee of small models beats one big model on the same budget. Compose both and you get roughly a 5x data-efficiency win that holds even when extrapolated to trillions of tokens. Then the practical kicker: distill that bulky committee back down into one small model and keep ~83% of the gains. Even weirder, a model distilling into a fresh copy of itself improves — self-teaching that actually works. The moral: in the coming data-scarce regime, the dusty classics — regularization, ensembling, distillation — are quietly the frontier.
Key Takeaways
- Speculative decoding trades spare compute for speed: a tiny model guesses several tokens, the big model verifies them in one pass. Verifying is cheaper than generating because transformers can grade many tokens in parallel but only write one at a time.
- SSD (speculative speculative decoding) removes the turn-taking bottleneck by drafting the next round while the current round is still being verified — hiding drafting latency. Hits 300 tokens/sec on llama 3 70B across 4 H100s.
- Inference speed is reframed as a capability, not a cost: if quality scales with thinking time, tokens-per-second sets the ceiling on intelligence.
- Model Predictive Control = an action proposer + a dynamics (world) model + a planner that imagines outcomes and picks the best action. DMPC uses diffusion models for both, reducing compounding error and simplifying the planner.
- Separating “how to move” from “what happens when I move” lets a robot adapt to a broken limb (re-tune only the dynamics model) or a new goal (jump instead of run) without full retraining.
- World models can be model-based (explicit internal picture) or model-free (situation → action directly). Model-based agents can quantify their own uncertainty — a detectable spike in prediction error when something unexpected happens.
- Representational collapse is the central failure of world-model training: the model cheats by making all situations look identical. LeJEPA prevents it with one regularizer (SIGReg) forcing latent embeddings into a healthy Gaussian spread — 15M parameters, single GPU, ~50x faster than rivals.
- Big models generalizing well is not a mystery — classical theory (PAC-Bayes, flat minima, soft inductive biases) explains overparameterization and benign overfitting once applied correctly. Bigger models find more compressible solutions, and flat-minima volume grows exponentially with size.
- A neural net is best understood as an expressive hypothesis space with a soft inductive bias — flexible enough to fit anything, but biased toward simple/compressible solutions.
- Data is the coming bottleneck: internet text grows ~3%/yr, pre-training compute grows ~4-5x/yr. In a data-constrained, compute-rich regime, classical tricks win — aggressive regularization + ensembling = ~5x data efficiency, distillable back into a small model retaining ~83% of the gain.
- Self-distillation (a model teaching a fresh copy of itself) surprisingly improves loss, behaving like an implicit 2-model ensemble.
Claude’s Take
This is a genuinely good sampler — five researchers, each given a few minutes, each forced to compress a real paper into its load-bearing idea. The format keeps everyone honest; there’s no room to hide behind notation. The recurring theme is the actual gift here: across inference, robotics, and pre-training, the frontier keeps turning out to be old ideas (CPU speculation, ensembling, regularization, distillation, Sutton’s 1990 world model) re-pointed at a new bottleneck. That’s a more useful mental model than “scale solves everything.”
Two honest caveats. First, this is a recruiting event as much as a seminar — the opening citation-and-fundraising roll call sets a self-congratulatory tone, and several results are on toy benchmarks (push-T, 200M-token sandboxes). The presenters are mostly careful to say so, which earns trust. Second, the “infinite compute” framing is a deliberate provocation; nobody has infinite compute, and a 5x data-efficiency win is real but not civilization-altering. Still, the asymptote-chasing methodology is a clean idea.
An 8 because the signal-to-noise is high, the explanations are unusually honest about limitations, and at least three of the five ideas (inference-as-capability, surprise quantification, data-constrained scaling) are worth carrying around. Docked from higher because the talks are necessarily shallow — these are appetizers pointing at papers, not the papers themselves.
Further Reading
- Andrew Gordon Wilson, “Deep Learning Is Not So Mysterious or Different” — the cleanest of the five, demystifies generalization with classical theory.
- LeCun et al., JEPA / LeJEPA — joint-embedding predictive architecture; the world-model paper and the $1B bet behind it.
- Richard Sutton, “Integrated Architectures for Learning, Planning, and Reacting” (1990) — the original world-model framing (the Dyna paradigm).
- Hoffmann et al., “Chinchilla” scaling laws — the compute-optimal baseline the infinite-compute paper deliberately departs from.
- Lotfi et al. — on compressible solutions and non-vacuous PAC-Bayes generalization bounds, the backbone of the generalization talk.