Inference Diffusion World Models And More Yc Paper Club
read summary →TITLE: Inference, Diffusion, World Models, and More | YC Paper Club CHANNEL: Y Combinator DATE: 2026-05-28 ---TRANSCRIPT--- All right. Hello everyone. How you guys doing? Welcome to the first ever YC paper club. This is like a very exciting thing. Absolutely thrilled with the response. We had over a thousand folks that applied to come in. It was a very hard selection. If you guys have friends that didn’t make the cut, I’m very sorry. We’re we kind of we need to keep it to about a hundred. Um and so we selected a very very cool group. Um the mission is to create this kind of community of great founders and great researchers and try to pull them together. I guess just for you guys to get a sense for how cool the people in this room are. Um, raise your hand if you have at least five citations, 10 citations, a 100 citations, a thousand citations. Wow, this is insane. Okay, 10,000 citations. Oh my god. Okay. All right. This is awesome. I I would go up to 300,000, but I think it’s like Chris Manning and that’s about it. Um, so, uh, raise your hand if you’ve raised at least a million dollars. Raise your hand if you’ve re raised at least $5 million. At least $10 million, at least $50 million. We still got one. We still got two over here.
[Opening remarks about the YC Paper Club, Pioneer / Woodside history, early OpenAI days, and gathering the South Bay AI community. Five papers follow.]
PAPER 1 — Tanishk, Stanford grad student — Speculative Speculative Decoding (SSD): I’m going to be evangelizing inference for people today. The mental model I had for how inference works was you do this beautiful craftsmanship during training, get these intricate weights, then hand it off to generate tokens. In my mind it’s sort of like you have the weights, just multiply the matrices, why do you need a team for it? But there is in fact a lot of subtlety. Inference costs dominate training costs when serving billions of users. Even within training, RL is starting to exceed pre-training compute, and what is RL but a wrapper on inference. The third point isn’t talked about enough: the reason I got interested in making inference fast was not cost or convenience, it was capability. If you have a method where performance scales with the amount of thinking it does, then the speed at which you can do inference — tokens per second — is exactly the peak intelligence you can deliver. I wanted to work towards a future where we have a data center of 20,000 B200s working on the Riemann hypothesis.
[Demo of three algorithms side by side: normal autoregressive decoding, vLLM speculative decoding, and the hand-rolled SSD engine, which is faster.]
Speculative decoding: a small model (tiny llama) and a big model (big llama). Goal: sample fast from the big llama. The draft generates tokens one by one — autoregressive, several forward passes — guesses for what the big model will output. The target (big) model verifies these guesses in one forward pass over all generated tokens. The key asymmetry: it is easier to verify than to generate. The transformer can get probabilities for many tokens in parallel in one pass, but can’t generate them in parallel. We accept the tokens the big model could plausibly have generated. At the rejection point, you can sample an extra “bonus token” for free without more forward passes. Speculation is currency exchange — trade flops for latency. It’s a deep idea in CS, used in CPUs. But you can’t push it arbitrarily: the bottleneck is the sequential dependence between the small and big model — drafting in round t must precede verification, and round t+1 drafting needs the verification outcome as prefix.
SSD’s goal: parallelize this sequential operation. Drafting and verification happen at the same time, not collocated. The draft sends back tokens, then immediately starts anticipating the most likely verification outcomes and drafts the next round on top of those while verification runs. If right, drafting latency is hidden. The principal difficulty: predicting verification outcomes ahead of time. You make many guesses on the draft. A verification outcome is a plausible number of accepted tokens plus a bonus token. The bonus token comes from a vocabulary of tens to hundreds of thousands, hard to predict, but you can get it right 80-90% of the time using the draft model’s token distributions (the tokens you chose not to sample are plausible bonus-token candidates). Decode them in parallel as different sequences on a shared prefix. Bonus: verification takes a while, so you get more time to draft, increasing expected tokens per round. Paper covers cache misses, compute allocation across prefix lengths (don’t allocate equally), and cache-hit-rate vs drafting-quality trade-offs. Result: numbers go up. SSD beats SGLang (fastest open-source spec-decoding engine they tried). Speculative decoding is normally a latency win but unclear for throughput; here it wins both. 300 tokens/second for llama 3 70B on 4 H100s.
PAPER 2 — Stannis, research scientist at Google DeepMind — Diffusion Model Predictive Control (DMPC): Currently co-leading a new project on world modeling for robotics, building general-purpose policies on top of video and world models. This is early work from about two years ago. Model Predictive Control (MPC), also called receding horizon control, uses a dynamics model (world model) and an action selector (planner) to build agents that solve tasks by maximizing an objective. Advantages: it can adapt to novel reward functions at test time, dynamics models are easier to learn and generalize better than policies, and the action-proposal/dynamics-model factorization allows easy adaptation to novel dynamics. The overall idea is simple: an action proposal proposes a sequence of actions, a dynamics model evolves them to give future states, an objective function we optimize via a planner, then pick and execute actions.
Two problems to make MPC effective: dynamics models must be accurate to avoid compounding errors, and the planner must be powerful enough to select good actions. DMPC uses diffusion models to learn both multi-step action proposals and multi-step dynamics models. Advantages: reduces compounding errors and simplifies the planning algorithm — a simple sampling-based planner already outperforms previous approaches. [Hierarchical view of related work: factorized policy+dynamics, Dyna paradigm, MPC, joint models, model-free. Trade-offs in runtime planning, adapting to novel rewards/dynamics, leveraging non-expert data, runtime speed, single-step vs multi-step.]
Diffusion-based agents: diffusion policy (condition on observation, generate future actions — needs expert demos, behavior cloning), diffuser (jointly model observations and states in trajectory space — implicit world modeling + model-based planning), decision diffuser (condition on history, generate future observations, separate inverse dynamics model for actions — can learn from video-only data, which matters because robotics data is the bottleneck), and DMPC (action proposal + dynamics model + planner — allows runtime adaptation to novel rewards and dynamics). Algorithm: from offline data, learn a policy (observation → actions) and a dynamics model (actions → future states). At inference, sample action proposals, score, rank, pick best. Difference vs prior: multi-step action proposal (more action-space coverage if trained on diverse data) and multi-step dynamics model (evolve over long horizons without compounding error). Diffusion is powerful for multimodal data, and stronger modeling lets us simplify the planner.
Results: competitive in fixed-reward single-task setups. More interesting: adapts to novel rewards at runtime (train on locomotion like run forward, then change reward function to get jumping). Adapts to novel dynamics where joint-modeling approaches struggle — e.g. walker with a broken left ankle: keep the action proposal, adapt only the dynamics model on play data from the new environment, recover much of the performance. Ablations confirm each component (diffusion action proposals, multi-step proposals, multi-step dynamics) improves performance.
PAPER 3 — Isaac Ward — LeJEPA / “Lay World Model” (out of Yann LeCun’s group): I started working on world models a couple years ago before they got hot, now they’re having a moment. Hidden in this presentation is a billion-dollar question — Yann LeCun’s raise of $1.03 billion back in March basically just to train world models. World models: learning the dynamics of the world, using a big neural network to predict how a system changes over time based on inputs. Current state S, play an action (movement, command, language command for a robot), predict the outcome scenario. Capabilities: generating imagined outcomes (the weird hallucinatory imagination sequences), model-based control, and surprise quantification. Not a new idea — Richard Sutton, NeurIPS 1990, describes exactly a modern world model: a black box taking situation and action, outputting a prediction of the immediate next situation.
Changing notation from state to observation (real systems have sensor observations, not true state). Example: a quadrotor world model — observation is kinematic state (position, velocity) plus forward-facing camera images, action is a control input (yaw, move left), prediction is the next observation including generated sensor images. Challenges: action sequences can be long, and the minimum in the optimization landscape may not correspond to desired behavior. The big question: model-free vs model-based policies. Model-free: observations → big neural network → optimal action, no explicit representation of the future. There’s growing evidence these networks contain obfuscated, hard-to-interpret world models in their weights. Model-based: train the world model explicitly, use it to predict outcomes of candidate actions. Model-free shows brittleness to out-of-distribution; model-based lets you quantify modeling error (important for real-world deployment) but needs an extra mechanism to propose action candidates.
Push-T toy example (push blue tea into green slot). Challenge of training: co-learning the representation of the world (compactly representing high-dimensional images/LiDAR) and the dynamics (how actions change that representation). Many optimization solutions do nothing — a local minimum is “every state is the same,” a trivial collapse. Many techniques avoid collapse. Popular world models: PLDM (planning with latent dynamic models), DINO-WM, Dreamer (DeepMind), TD-MPC. They fall into three categories to avoid collapse: explicit heuristic enforcing healthiness in latent space; foundational methods (use an existing autoencoder/diffusion/video model as a basis + action conditioning); or privileged data not usually available outside training.
JEPA = joint embedding predictive architecture, LeCun’s main work. LeJEPA is a JEPA model: image encoder encodes observation into a latent vector; train an action-conditioned forecasting module (predictor) to predict the next latent embedding given an action — not the next image, the next latent — then decode back to an image. Over a batch, all latent embeddings should be Gaussian-distributed in latent space, enforced by the SIGReg regularizer. SIG = Sketching (1D passes over high-dimensional data), Isotropic (looks the same sliced in any direction), Gaussian-distributed. Take all embeddings, do a 1D slice over each direction, want each curve to be Gaussian; if so the latent distribution is healthy. Cheaply evaluate how Gaussian/healthy/non-collapsing your world model is. Add the SIGReg term to the normal predict-next-latent loss — an elegant regularization. Capabilities: (1) open-loop prediction quality — imagined rollouts match real on push-T and push-cube; (2) model predictive control — encode initial and goal observation, search over actions to move from start to end latent. LeJEPA wins on small 2D tasks; DINO-WM wins on 3D (big foundational backbone). About 50x faster than competition (all work in latent space, no extra forward passes or two model copies), runs on a single card under 24GB VRAM, only 15M parameters. (3) Surprise/model-error quantification: perturb the world model (change tea color, teleport the tea) and you see a detectable spike in model error — world-model-enabled agents can quantify their own uncertainty. Model-free approaches don’t natively give you this.
PAPER 4 — Ashe (QA Labs / “QABs”) — Andrew Gordon Wilson’s “Deep Learning Is Not So Mysterious or Different”: We know scaling models leads to better generalization but lack a mechanistic understanding of why. If we understand generalization we might optimize for it — large payoff. People point to overparameterization, benign overfitting, and double descent as mysteries we can’t understand. Andrew’s work dispels these using classical generalization theories. First: PAC-Bayes bounds test loss (generalization) with a training loss term plus a compression term. Historically when people overparameterize, the compression term dominates and bounds become loose/vacuous — but this was a misapplication; you can compute the compression term differently. Mystery 1, overparameterization: from the bias-variance trade-off you’d expect overfitting as you scale, but scaling laws show better generalization. PAC-Bayes view: empirical risk (training loss) goes down as you fit data better, AND (Lotfi et al.) larger models find more compressible solutions — a negative correlation between bits to encode the training set and parameter count, so the compression term also drops. Flatness perspective: as parameters increase, the volume of flat minima exponentially increases while sharp minima grow much less; flat minima are more compressible. So overparameterization fits existing theory, giving useful bounds even at billion-parameter scale.
Mystery 2, benign overfitting: deep nets can fit totally random noise yet generalize well on structured data. A regularized polynomial model gives intuition — on random data, enough parameters to fit it; on structured data, regularization pushes toward lower-order terms, giving both flexibility and inductive bias. Neural nets are expressive models with a soft inductive bias. Flexible hypothesis space fits the data but overfits without a bias; pure inductive bias avoids overfitting but can’t model reality’s details; the middle ground is an expressive hypothesis space biased toward generalizing (e.g. compressible) solutions. By the no-free-lunch theorem, the only way to improve learning efficiency is through inductive biases. Given the massive sample-efficiency gap between AI and humans, finding the right inductive biases is a good bet.
PAPER 5 — Konwoo (with Suhas, Percy, Tatsu) — Pre-training under Infinite Compute (data-constrained): The two major problems left in AI are intelligence per watt and intelligence per sample. We’re an order or two off humans on per-watt, orders of magnitude off on per-sample. In Chris Re’s lab: under a fixed amount of data and infinite compute, how much generalization can you achieve? Pre-training kept improving capabilities surprisingly — GPT-3 (2020, in-context learning), Anthropic RLHF (2022, alignment), o1 and DeepSeek R1 (2024, reasoning). Because pre-training is expensive, research focused on compute efficiency, which needs scaling both parameters and data (Chinchilla scaling laws). Problem: we’ll soon be constrained by data. Internet text grows ~3%/year; pre-training compute grows ~4-5x/year. So compute spent per data point grows ~4x year-over-year. Core question: how should you pre-train when constrained by data but unconstrained by compute? Not unlike classical statistics or old benchmarks (MNIST, Penn Treebank) where you’re implicitly data-constrained.
Bring scaling laws to the problem. Chase recipes that monotonically decrease IID validation loss, follow clean power laws; the power-law asymptote estimates best possible loss under infinite compute. Canonical setting: only 200M tokens from DCLM, train larger and larger models. Standard recipe (epoch the data + scale the model + early stopping): even spending more compute, overparameterized models overfit faster and loss increases. Fix with aggressive regularization: optimally tune learning rate, weight decay (something like 30x larger than compute-optimal), and epoch count per parameter count — loss follows a clean power law with model-parameter exponent of 1 (predicted by data-constraint theory) and an asymptote (3.43 here). Baselines that overfit don’t even have a measurable asymptote.
Ensembling: ensemble 300M-parameter models with more members (5 members = 1.5B total params), also a clean power law with exponent 1, and a much lower asymptote than the regularized recipe — a true data-efficiency win. At compute-matched comparison, ensembling beats the regularized recipe — better to train an ensemble of small models than one large model when data-constrained. Compose both: regularization lets you make models larger, ensembling adds a new scaling axis (more models). The “joint scaling recipe” (gold line) takes a double limit — fit scaling laws over ensemble members (K), then a second scaling law over the asymptotes as model size (n) grows — a huge loss improvement. Confirm recipes scale via data scaling laws (repeat at four token counts up to 1.7B). Project a new recipe’s loss onto the standard recipe’s data scaling law to measure effective extra tokens: the joint recipe gives ~5x data efficiency, realizable with finite models (a 5-ensemble of 1B models gives ~3.7x). Similar exponents/asymptotes suggest the 5x win is roughly constant even at 10 trillion seed tokens.
Make it practical with distillation: distill an 8-ensemble (~2.4B total params) into a single dense 300M model, retaining ~83% of the loss improvement — data efficiency doesn’t need large inference compute. Self-distillation (distill a 300M model into a fresh 300M model) surprisingly improves loss, even beating the regularized asymptote — connected to ensembling, viewing self-distillation as implicitly training a 2-ensemble. Trends hold on held-out downstream benchmarks. Works beyond pre-training: continued pre-training a 3B model on 4B math tokens (from a 73B corpus) with aggressive epoching + ensembling matches training on the full 73B tokens — ~17x data efficiency. Main point: in this new regime (data-constrained, compute-unconstrained) algorithmic choices matter a lot; revisit classical ideas (regularization, ensembling, distillation) and chase algorithms with lower compute asymptotes.
[Closing remarks: thanks, Slack invite, boba.]