Horace He: Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

ELI5 / TLDR

Horace He works on PyTorch’s compiler team at Meta and writes the blog Making Deep Learning Go Brrrr — he’s the person behind torch.compile and flex_attention. His thesis: training a frontier model now consumes about 10^26 floating-point operations, but the surprising bottleneck isn’t doing the math — it’s shoving numbers between GPU memory and the compute units. He walks through why that is, why fancy compilers keep failing in machine learning, and why the real game today is designing programming models — small, predictable APIs — rather than smarter optimizers.

The Full Story

A trillion trillion is a real number now

Open with the absurdity. Frontier models are trained with roughly 10^26 floating-point operations. He calls that “a trillion trillion teraflops” and lets the number sit there. Nuclear power plants get bought to feed this. Startups raise a billion dollars not to finish something but to start, and most of the billion immediately becomes Nvidia’s revenue. The whole industry is fundraising to do gigantic multiplications, over and over, for months.

Underneath all of it, the actual code is embarrassingly simple. Andrej Karpathy implemented Llama 2 inference in 973 lines of plain C. No dependencies. Every loop, every matrix multiply, hand-written. It runs. It’s slow but it works.

“machine learning logic is exceedingly simple. There’s like this cool project from someone called like Andrej Karpathy called Llama 2.c… in about 973 lines of C with like no other dependencies.”

So we are pouring an unprecedented amount of money and electricity into running an extremely small program at extremely high efficiency. That asymmetry sets up everything else.

What “performance” even means here

Quick bridge for non-GPU readers. A “FLOP” is one floating-point operation — adding two decimal numbers, multiplying two decimals, that kind of thing. Modern GPUs are rated by how many of these they can do per second; the top of an H100 is somewhere around 10^15 per second. A “matmul” is a matrix multiply: take two grids of numbers, produce a third grid. It’s the only thing neural nets really do at scale.

In normal CPU code, if you measured what fraction of your CPU’s theoretical max FLOPs you actually used, you’d probably hit under 1%. CPUs spend most of their time waiting on branches, loading from memory, doing pointer chases. Nobody cares.

In ML, the equivalent metric — “model FLOP utilization” — is routinely 50%. People think 50% is bad. They want more.

“if on a CPU you measured any of your code by this metric, you can only hit 100% if at every single time every single core of your CPU is always issuing max width SIMD instructions… on the other hand in machine learning for like large scale training we’re typically often hitting around like 50% of like the peak flops.”

Hold that contrast. ML expects every silicon transistor to be earning its keep, every cycle. That expectation is what creates the discipline he describes.

How frameworks evolved (and why Python won)

Quick history. Around 2012 the dominant ML framework was Caffe, where you defined a network by editing a Protobuf file. Then TensorFlow 1, which let you build a graph in a Python DSL but still executed it as one opaque blob — you could not just print a tensor and see what was inside.

PyTorch (~2017) won by being boring. You write Python, each line runs immediately on the GPU, you print things, you debug with pdb. He calls this “eager execution.” It looks slow on paper — every Python line dispatches one tiny GPU operation — but it wasn’t slow, for a precise reason:

GPUs run asynchronously. Python isn’t actually executing the matmul. It’s putting matmul-shaped tickets onto a work queue. The GPU pulls tickets and grinds. As long as Python can put tickets down faster than the GPU consumes them, the GPU never waits, and Python’s overhead is invisible.

He uses a Wallace and Gromit gif as the mental model. Gromit (Python) is hammering down train tracks one second ahead of the engine (the GPU). Train never stops. Python looks free.

“as long as like Gromit is able to put down the train tracks faster than the train actually rolls along the train tracks, you can actually kind of view Python as like having zero overhead.”

This is the sentence to remember. CPUs and GPUs in ML are not equal partners; the CPU is a janitor scheduling work, and the GPU is the entire rest of the company.

Then Tensor Cores broke the model

In 2017 Nvidia shipped Tensor Cores — a separate piece of silicon on the GPU that does only matrix multiply, but does it about 10x faster than the rest of the chip. Suddenly the chart of “how fast is matmul vs how fast is everything else” forks by a factor of ten. Anything that isn’t a matmul — adding two tensors, applying a non-linearity, normalising — runs at maybe 7% of peak utilisation. So if you can’t keep your work on the Tensor Cores, you’ve thrown away 90% of your hardware.

This is what made ML compilers necessary. Eager mode runs each line one at a time. That worked when 90% of time was matmul. It does not work when 30%+ of time is non-matmul scaffolding that all has to be squeezed.

The three things a GPU does

This is the load-bearing mental model of the talk. He breaks GPU time into three buckets.

Compute. Actually doing math. On modern GPUs, “compute” means matmul — anything else is roughly free in flops but not in time, for the next reason.

Memory. Moving tensors between two different memories on the GPU. The GPU has a big slow pool (HBM/VRAM, where your tensors live) and a small fast pool (SRAM, right next to the compute units). You can only do math on data that’s in the fast pool. He uses the factory/warehouse analogy: warehouse is huge, factory is tiny, and a truck has to drive between them every time you want to operate on something.

Overhead. The GPU sitting idle because the CPU hasn’t dispatched the next thing yet. Gromit fell behind. The train is parked.

The shocker is the relative weights. In a paper analysing a BERT-style model: matmuls are 99.8% of the FLOPs but only 61% of the runtime. The remaining 0.2% of the work — the simple element-wise ops, the normalisations — eats almost 40% of the wall clock.

“matrix multiplications are responsible for like 99.8% of your flops, but they’re only responsible for 61% of your runtime.”

Where does the 40% go? Into trucks. Each cheap operation has to load data from VRAM into SRAM, run, write the answer back to VRAM, then the next op has to load that same data back into SRAM again. The arithmetic is trivial. The shipping is not.

This is memory-bandwidth-bound computing, and it’s the single most important fact in the talk. The GPU is not a math machine that occasionally waits on memory — it’s a memory-shuffling machine that occasionally does math.

Operator fusion: the most important compiler trick

The fix sounds obvious once stated. Instead of running add, mul, cos as three round-trips, do all three on the SRAM-resident copy before writing back. One truck trip instead of three. He calls it “the most important optimization in a deep learning compiler by far.”

“operator fusion. And so what an operator fusion does is that instead of like, you know, sending the data back and forth uh so much, uh we like do a single GPU kernel uh where you send the data once uh to the factory units, you do all of the operations, uh and then you send the data back.”

Notice why eager mode can’t do this — eager mode commits to each operation before seeing the next one. You need a compiler to look ahead, see the chain of three ops, and emit a single fused kernel. That’s most of what torch.compile does.

There’s a related trick: recomputation vs reuse. Sometimes it’s faster to throw away an intermediate value and recompute it later than to ship it back to VRAM and reload it. ML programs make this unusually important because of backpropagation. Forward pass goes layer 1 → 2 → 3 → 4. Backward pass goes 4 → 3 → 2 → 1, and it needs every intermediate from the forward pass. Those intermediates have to be saved somewhere across the entire forward run — they’re “long-lived” in a way most normal program intermediates aren’t. This unusual lifetime pattern is why memory pressure dominates ML training and why most “out of memory” errors happen.

A brief detour: NaNs run faster than real numbers

The funniest section. He asks: do the contents of a matrix affect how fast a matmul runs? You’d assume no — same memory pattern, same instructions, no branches.

But yes. Multiply matrices full of zeros and you measure higher throughput than multiplying matrices full of real numbers. The reason is dynamic switching power. Every time a transistor flips from 0 to 1 or back, it dissipates a tiny bit of energy. Multiply by zeros and most transistors don’t flip; the chip stays cool; it doesn’t throttle. Multiply by random reals and the chip heats up and throttles down.

A researcher once saw their training run start producing NaNs and noticed that throughput went up. NaNs flip even fewer transistors than zeros. Their model was broken but flying.

“they like at some point uh their model would NaN, and then they’d be like, ‘Wow, my performance just got way better.’”

This is the kind of detail that makes “abstraction” look thinner than you think.

Why compilers keep failing in ML, and what to do instead

This is the back half of the talk. Horace works on a compiler team and says compilers, by themselves, are not enough. He sets up the problem with a fake library called HEL — Horace’s Exciting Library:

Doesn’t always work.
No documentation about when it works (read the source).
Every version may change what works without warning.

Would you use it? Of course not. But that, he argues, is exactly the user experience of compiler optimisations. They fire on some patterns and not others, the rules aren’t documented, and the next compiler release silently moves the goalposts. He quotes the ISPC compiler designer:

“as long as vectorization can fail and it will then if you’re programmer that actually cares about what code the compiler generates for your program you need to deeply understand the compiler… and so this is like a very horrible way to program.”

His prescription, sharpened: a compiler optimisation that always works is just part of the language. SIMD intrinsics work because they’re guaranteed to map to SIMD instructions — that’s not an optimisation, it’s a contract. Auto-vectorisation isn’t a contract, it’s a hope.

The flash attention story

He uses attention to make the point concrete. Attention is the operation at the heart of every transformer — for each token, look at every other token and weight them. A naive PyTorch implementation is matmul → softmax → matmul. Three operations, three round trips to VRAM.

In 2022 Tri Dao’s FlashAttention paper showed how to fuse all three into a single kernel that never materialises the giant intermediate matrix. Massive memory savings, big speedup. Question: how do you give people FlashAttention?

Option A — pattern matching. Have a compiler spot the matmul → softmax → matmul shape and rewrite it to FlashAttention. Problem: any user who writes softmax slightly differently silently falls off the fast path and their code is 3x slower with no warning. HEL again.

Option B — a single monolithic op. Ship flash_attention(q, k, v) as one PyTorch call. Works, but every new attention variant — sliding window, ALiBi, page attention, prefix LM — needs the kernel rewritten by hand. He shows a screenshot of the FlashAttention repo’s signature, which has accumulated dropout, softmax_scale, causal, window_size, soft_cap, alibi_slopes, and growing. People are constantly asking for their variant.

Option C — a programming model. This is FlexAttention, his work. The trick: every attention variant is the same kernel plus a small mathematical modification (a mask, a bias). Let the user write the modification in plain Python, lift it into the kernel mechanically. The compiler isn’t deciding whether to fuse — fusion is guaranteed by construction. If your code type-checks against the FlexAttention API, you get a fused kernel. Always. Predictably.

“a single monolithic operator is not actually always sufficient… but you know this is kind of where you kind of can be clever and come up with like a new program model that wasn’t either of the program models that users had before.”

This is the philosophical core of the talk. Don’t try to make the compiler smarter. Design a small API where the optimisation is part of the contract, then hide the ugly kernel behind it. Users get to write weird new attention variants — molecular graphs, prefix LM, whatever — and the speed comes free.

He invokes Grothendieck’s walnut analogy: you can crack a math problem by hammering it open, or you can soak it in water until it opens itself. Programming models are the soaking.

Distributed training: the scale layer

The last act zooms out from one GPU to 131,000 of them. The vocabulary:

Data parallelism — every GPU gets the same model, different data. After each step, gradients have to be summed across every GPU. There’s a math constraint that prevents you from just throwing more GPUs at this — past some batch size, models stop learning.
Tensor parallelism — split a single matmul across multiple GPUs. Hard to overlap communication with computation because you can’t start the math until the data shows up.
Pipeline parallelism — assign layer 1 to GPU 1, layer 2 to GPU 2, like an assembly line. Backprop adds wrinkles because the line has to flow forward then backward.

Llama 3 uses all three, plus a fourth (context parallelism), simultaneously. The diagram is a four-dimensional packing problem.

The unsettling closing fact: at 16,000 GPUs, the mean time between hardware failures is 1.8 hours. At 131,000 GPUs it’s about 15 minutes. Frontier training is a race to take a single optimization step before something somewhere catches fire.

“now you have a situation where you might not be able to make even a single step uh before a single GPU in your entire fleet uh fails.”

He closes on what he thinks the interesting question is. Not “how do we build smarter compilers.” Rather: what programming models will let people express the next 100,000-GPU training run?

Key Takeaways

ML compute is dominated by matrix multiplications, but ML runtime is dominated by memory bandwidth — moving tensors between VRAM and SRAM. Your GPU is a shipping company that occasionally does math.
“Model FLOP utilization” of 50% is normal in ML training. The same metric on normal CPU code would be under 1%. The gap is what creates an entire industry of kernel-level engineering.
Tensor Cores (2017) made matmul ~10x faster than everything else. That single hardware change is why ML compilers exist — eager mode was fine until non-matmul work suddenly became the bottleneck.
Llama 2 inference is 973 lines of dependency-free C. The complexity isn’t the model, it’s the optimisation.
Backpropagation gives ML programs an unusual lifetime structure: forward-pass intermediates have to live until the corresponding backward-pass step. This is why “out of memory” is the dominant failure mode and why recomputation-vs-reuse trade-offs matter so much.
Multiplying zero-filled or NaN-filled tensors is measurably faster than multiplying real ones because fewer transistors flip, so less heat, so less throttling. Your benchmarks lie if your data isn’t realistic.
Operator fusion is the single highest-leverage compiler trick: do all the work on a tile while it’s in fast memory before writing back, instead of round-tripping per operation.
Compilers that “sometimes optimise” are user-hostile. The real win is programming models where the optimisation is part of the API contract — CUDA over auto-vectorisation, FlexAttention over pattern-matching softmax.
At 131,000 GPUs, hardware failures arrive every ~15 minutes. The training loop is now a fault-tolerance problem, not a math problem.

Claude’s Take

This is one of the best teaching talks on ML systems I’ve come across. Horace has a rare combination — he is deep enough in PyTorch internals to have written half of them, and clear enough as a writer to make a finance audience nod along. The Wallace and Gromit metaphor for async dispatch, the warehouse-and-factory framing for memory bandwidth, the HEL parody of unreliable compilers — these are durable mental models, not throwaway jokes.

The argument that “a compiler optimisation that always works is just part of the language” is the kind of sentence that re-organises how you think about an entire field. It applies far beyond ML — half of what people complain about in modern web tooling is “the bundler sometimes does X” — but it’s most violent in ML, where being 2x slow means being uncompetitive.

The talk’s only real weakness is the distributed-training section, which is too compressed to do justice to a topic that deserves its own hour. He gestures at task / data / pipeline parallelism but doesn’t explain why each is hard with the same care he gave the single-GPU memory story. Score docked half a point for that. Otherwise this is essentially required viewing if you want to understand why so much of the AI economy is, mechanically, a bandwidth problem dressed up as a compute problem.

9/10. The score is high because the talk gives you reusable abstractions, not just facts.