heading · body

YouTube

The Inference Shift | Stratechery by Ben Thompson

Stratechery published 2026-05-18 added 2026-05-19 score 8/10
ai semiconductors nvidia cerebras inference infrastructure agents compute
watch on youtube → view transcript

ELI5/TLDR

For the last few years, AI compute has meant one thing: Nvidia GPUs, expensive and fast, useful for both training models and answering questions. Ben Thompson argues that’s about to fracture into three different markets — training (still Nvidia’s), fast answers for humans (where Cerebras and Groq shine), and agents that work without humans watching (where speed barely matters and cheap, plentiful memory wins). The third market will be the biggest, because it grows with how much compute you have, not how many humans are waiting.

The Full Story

Why GPUs ended up running AI

Pixels and AI math have the same shape: lots of small calculations that can happen at the same time. Nvidia spent two decades making graphics chips programmable through a software layer called CUDA, so when neural networks needed parallel arithmetic, the hardware was already there. The hard part wasn’t the math, it was feeding the chip — modern models are too big to fit on one piece of silicon, so Nvidia layered on fast memory (HBM, the stack of RAM glued onto the chip) and fast wires between chips. That networking is what lets tens of thousands of GPUs behave like one machine, which is what training a trillion-parameter model actually requires.

The Anthropic–SpaceX deal in the essay is a tell: 220,000 Nvidia GPUs at Colossus 1, originally bought for xAI’s training, now repurposed for Anthropic’s inference. Same chips, different job. That fungibility is the GPU’s superpower.

Inference is actually three jobs in a trench coat

Thompson breaks inference into parts. Prefill is reading your prompt — heavy compute, parallel, GPU-friendly. Decode is generating each new token, and that splits into two interleaved steps: looking up the conversation so far (the KV cache, which grows as the chat grows) and pushing that through the model’s weights. Both decode steps are memory-bound. The chip mostly waits on RAM.

Think of it like a chef. Prefill is chopping all the vegetables at once. Decode is plating dishes one at a time, where every dish requires walking back to the pantry twice. GPUs are good at all three because they have raw speed, big HBM stacks, and fast wires between chips. But that’s overkill if the bottleneck is just the walk to the pantry.

What Cerebras actually is

Cerebras took the same silicon wafer everyone else cuts into hundreds of chips and decided not to cut. The whole 300mm disk becomes one chip. The result is 44 GB of on-chip memory at 21 TB/s — half the memory of an Nvidia H100, but six thousand times the bandwidth. If your model and its context fit on the wafer, tokens stream out absurdly fast. The moment they don’t fit, the magic disappears. And because the manufacturing yields on a wafer-sized chip are brutal, it’s expensive.

The WSE3 has just over half the memory of an H100, but 6,000 times the memory bandwidth.

Today this matters for coding assistants — reasoning models burn through tokens, and a developer staring at a screen feels the speed. Tomorrow it matters for voice and wearables, where latency is the product. But Thompson thinks even the coding use case is temporary, because eventually the human leaves the loop.

Answer inference vs agentic inference

This is the essay’s load-bearing distinction. Answer inference is what we do today — a person asks, the model responds, the person reads. Speed matters because humans are impatient. Agentic inference is the future Thompson sees — a computer kicks off a task, other computers do the work, results land hours later. Nobody is staring at a token counter.

Once you remove the human, two things flip. First, latency stops mattering. An agent doesn’t care if a step takes 200ms or 20 seconds. Second, the bottleneck moves from compute to memory — not the small, fast, expensive HBM that GPUs hoard, but the giant, slower, cheaper memory needed to hold context, state, history, embeddings, logs. Plain DRAM. SSDs. Databases.

The most important aspect for answer inference is token speed. The most important aspect for agentic inference, however, is memory.

Once that flip happens, the case for paying a premium for cutting-edge GPUs weakens. You need “good enough” compute wrapped in a sophisticated memory hierarchy. CPU speed for tool use starts mattering more than GPU speed. Nvidia knows — they’ve launched a framework called Dynamo to split inference workloads across different hardware, and they’re shipping standalone memory and CPU racks. The defensive move is visible.

Why the agentic market becomes the biggest one

Answer inference is gated by humans. There are only so many of us, and we only ask so many questions per day. Agentic inference scales with compute itself — agents kicking off other agents kicking off other agents. The ceiling is electricity and silicon, not patience.

Two side observations fall out of this. China, which is locked out of cutting-edge chips but has plenty of decent ones, has more or less everything it needs for agentic inference. The compute embargo bites for training, less so for the largest future market. And space data centers become more plausible — older, simpler chips run cooler, survive radiation better, and don’t need to be repaired, which suits orbit fine.

Jensen Huang likes to say Moore’s Law is dead and that future speed-ups will come from systems design. Thompson’s twist: maybe the most important shift is that Moore’s Law just stops mattering. The compute we already have is good enough. The game is now arranging it cleverly.

Key Takeaways

  • Inference is three operations, not one: prefill (compute-heavy, parallel), KV cache lookup (memory-bandwidth bound, grows with context length), and feed-forward over weights (memory-bandwidth bound, fixed by model size).
  • GPUs dominate because they’re the only chip that can do training AND all three inference steps reasonably well — flexibility is the moat, not raw speed.
  • Cerebras’s wafer-scale chip trades memory capacity for memory bandwidth: 6,000x an H100’s bandwidth, but half the memory. Brilliant when the workload fits, useless when it doesn’t.
  • Wafer-scale manufacturing has yield problems baked in — one defect on a normal die loses one chip, one defect on a wafer-chip loses the whole thing. That’s why Cerebras is expensive.
  • Cerebras’s “fast coding” pitch is a stepping-stone market. The real future use case is voice/wearables, where token-generation speed becomes a UX feature humans physically feel.
  • The Anthropic–SpaceX Colossus 1 deal (220k GPUs, 300MW) is also a data point: training clusters and inference clusters are interchangeable on Nvidia hardware. That’s not true elsewhere.
  • Agentic inference removes the human-in-the-loop latency constraint. Once latency stops mattering, the entire chip selection logic inverts.
  • When memory becomes the bottleneck, slower/cheaper DRAM and SSDs beat expensive HBM. The system spends most of its time waiting on memory anyway.
  • CPU speed (for tool calls, orchestration, I/O) will matter more than GPU speed for agentic workloads.
  • The agentic inference market scales with compute, not with human population — which is why Thompson thinks it will be the largest market by far.
  • China’s chip-embargo problem is asymmetric: painful for training and answer inference, much less painful for agentic inference, where their domestic chips are already “good enough.”
  • Space data centers become more viable in an agentic world: older nodes are larger (more radiation-resistant), cooler, more reliable, and don’t need to be repaired.
  • Nvidia’s Dynamo framework + standalone memory/CPU racks are the company hedging — disaggregating inference so their expensive GPUs stay busy doing the part they’re best at.
  • Reasoning models (o1-style, Claude with thinking) burn many more tokens per query than predecessors — this is what’s actually driving the inference compute boom, not just user growth.
  • Thompson’s broader thesis: Moore’s Law isn’t just dead, it’s becoming irrelevant. The next decade’s compute story is about arranging existing compute cleverly rather than making each chip faster.

Claude’s Take

This is one of those Thompson essays where the framework is the whole product. The answer-inference vs agentic-inference split is genuinely useful — it explains why Cerebras and Groq aren’t really competing with Nvidia (different markets), why China’s chip situation might be less dire than it looks for the biggest future segment, and why Nvidia’s been quietly shipping CPU and memory racks. The pieces snap together.

The weakest move is the confident claim that agentic inference will be “the largest market by far.” It’s directionally plausible — uncapped by humans means uncapped by anything but power — but Thompson glides past how much of today’s “agent” demand is actually agents and how much is fancy chatbots with retrieval. The economics of an agent doing work without supervision also depend on someone wanting that work done badly enough to pay for cycles. That’s not infinite the way he implies. There’s a Jevons paradox argument that makes it work, but he doesn’t make it.

The China and space asides are throwaways — interesting, but more like LinkedIn-bait than load-bearing. The core idea, that inference is unbundling into specialized stacks and Nvidia’s flexibility premium will erode at the agentic end, is the takeaway worth carrying. 8/10 — high information density, clean conceptual handles, a few too-confident extrapolations that you should mentally discount.

Further Reading

  • Ben Thompson, Agents over Bubbles — Stratechery, referenced as the prior piece arguing for the three LLM inflection points (ChatGPT → o1 → Claude 4.5/Code).
  • Anthropic blog post on the SpaceX Colossus 1 deal — 300MW, 220k Nvidia GPUs.
  • Nvidia’s Dynamo inference framework documentation — for how Nvidia is responding architecturally.
  • Cerebras WSE-3 whitepaper — for the wafer-scale-engine technical details (44GB SRAM, 21 TB/s bandwidth).
  • Groq’s LPU architecture — the other “answer inference” specialist, deterministic latency story.