Chip design from the bottom up – Reiner Pope

ELI5/TLDR

Reiner Pope (CEO of MatX) takes Dwarkesh through how an AI chip is actually built, starting from the smallest possible Lego brick — a single logic gate — and working all the way up to a TPU. The big lesson is that almost everything interesting in chip design is a fight between doing the math and moving the numbers around. The math itself is cheap. Shuffling data from storage to the calculator and back is what eats almost all the area, energy, and time. Every clever trick in modern AI hardware — low-precision arithmetic, systolic arrays, scratchpad memories, tensor cores — exists to win a little more compute per unit of communication.

The Full Story

Building a multiplier out of gates

Start with the smallest thing on a chip: a single AND gate. It takes two bits in and outputs 1 only if both are 1. Now imagine you want to multiply two numbers, each four bits long.

The trick is the same long multiplication you learned in school, but in binary. You take each bit of one number and AND it with each bit of the other. That gives you a grid of partial products — sixteen ANDs in total for a 4-bit-by-4-bit multiply. Then you sum the columns.

The summing is where most of the work happens. The tool for that job is a full adder. Forget what the name sounds like — it doesn’t add 32-bit numbers. It adds three single bits and outputs two bits (the sum, and the carry). Think of it as a small device that counts how many 1s came in and writes the answer in binary. Three 1s in? Output is “11” (binary for 3). Two 1s? “10”. One? “01”.

You stack these full adders, eating three numbers at a time from a column and writing two back out, until eventually you’ve collapsed the whole grid down to one number. That’s the answer. The technique has a name — a Dadda multiplier — and it’s the standard way to build area-efficient multipliers on a chip.

“This is the standard for how you do area-efficient multipliers using full adders.”

A 4-bit-by-4-bit multiply-accumulate ends up costing 16 ANDs plus 16 full adders. In general: p × q gates. Quadratic in the bit width. This single fact does more work than almost anything else in modern AI hardware design.

Why low precision matters so much

If doubling the bit width quadruples the area, then halving the bit width should cut area by four. Going from FP8 to FP4 should, in principle, give you 4x more multipliers in the same silicon. Nvidia’s older chips advertised 2x — a kind of rounding down. The newer B300 generation has started owning the math properly and quotes 3x. Reiner notes the true answer is closer to 4x.

This is why every AI chip company is in a race to the bottom on precision. Lower precision isn’t just “a little cheaper.” It’s quadratically cheaper. The whole reason neural nets run on FP8 and FP4 today is that the silicon math forces the issue.

The data movement tax

Now zoom out a level. The multiplier sits inside a CUDA core (or a CPU’s ALU). The core has a small register file — eight slots, say — and to do one multiply-accumulate it needs to grab three values from those slots, run the math, and write the answer back.

The grabbing is done by something called a mux — short for multiplexer, but really just a fancy switch. To select one of eight inputs, the mux ANDs every single input with a mask of 1s and 0s, then ORs them all together. Software-wise it looks like “give me register 3.” Hardware-wise it’s a small forest of gates.

Count the cost. For three inputs, with eight registers each p bits wide, you’re spending 3 × 8 × p = 24p gates just to route data. The actual multiplier only needs about 4p gates. The shipping costs more than the product.

“Almost all your cost becomes synchronization or communication cost compared to the actual logic.”

This was the state of GPUs before the Volta generation. Most of the silicon was doing nothing interesting. Tensor Cores changed that.

Systolic arrays: hardening more of the loop

The idea behind a tensor core (or, more generically, a systolic array) is to stop hardening just one multiply-accumulate at a time. Bake in two whole loops of a matrix multiply instead. A whole grid of multipliers, wired together so the data flows through them like water through pipes.

The win comes from a property of matrix multiplication: the matrix you’re multiplying by — the weights, in AI terms — stays fixed for a long time. So you load it into the systolic array slowly, once, and then stream vectors of activations through it. The bandwidth coming from the expensive register file only has to scale with x (the side length of the array) instead of x times y (the total number of multipliers).

“We have x times y as much compute as we had before. But we want to aim for having only x times as much communication.”

Older TPUs ran 128-by-128 systolic arrays. That’s 16,384 multiply-accumulate units, all wired together, all running on every clock cycle. This turns out to be the most area-efficient way anyone knows to build a matrix multiplier.

The clock cycle

A chip has roughly 100 billion transistors, and they all need to coordinate. In software, threads use locks and mutexes. In hardware, they use a global clock — a signal that ticks every nanosecond or so, telling every register in the entire chip to update at the same instant.

In between two ticks, signals propagate through a cloud of logic. The constraint is brutal: every computation must finish before the next tick. If you want to run the clock faster — say two gigahertz instead of one — you have to make sure no path through the logic takes longer than half a nanosecond.

So chip designers insert pipeline registers — little checkpoints partway through the logic — to break up long paths. Add more registers, run the clock faster, but pay in silicon area. Push it too far and you’ve spent all your silicon on the checkpoints rather than the work being checkpointed. Same theme: compute versus communication, throughput versus everything else.

There’s a sneakier case too. Some logic feeds back on itself — a running sum, for example, where each clock cycle adds a new number to the total. You can’t just stick a register in the middle of that. If you did, you’d end up with two separate running sums (one of the even-clock numbers, one of the odd). These feedback loops set the chip’s maximum clock speed.

Why FPGAs are 10x slower than ASICs

An FPGA does the same thing as an ASIC — gates wired together by clock cycles — but in a programmable way. Instead of fixed AND, OR, XOR gates baked into silicon, an FPGA has lookup tables. A lookup table takes four input bits, looks them up in a small stored truth table, and emits one bit. You can program the truth table to behave like any gate you want.

The cost of all that flexibility: a single lookup table needs 32 gates to do the work of what would be three gates in an ASIC. That’s where the famous 10x slowdown comes from.

“There’s a more concise way to describe a truth table than listing out every single possible combination of inputs, which is just to write out the gate.”

The business case for FPGAs isn’t speed. It’s that the first ASIC costs $30 million in tape-out fees, while the first FPGA costs $10,000. If your workload changes every month — high-frequency trading, prototyping new circuits — you eat the 10x penalty to skip the tape-out.

Deterministic latency: scratchpads vs caches

Jane Street uses FPGAs because they need to know exactly when a packet will arrive and depart. A CPU can’t promise that, even though in principle nothing stops a CPU from being deterministic.

The culprit is the cache. When a CPU reads memory, hardware secretly checks whether the data is in a small fast cache near the core. If yes, you get the answer in a nanosecond. If no, you wait 100 nanoseconds while it fetches from main memory. Whether you hit or miss depends on what other programs ran recently, what’s in the cache, even random number generators inside the cache controller.

TPUs do it differently. They use a scratchpad instead. The scratchpad is the same kind of fast on-chip memory, but the software has to ask for it explicitly. One instruction reads from scratchpad, a totally different instruction reads from HBM (the off-chip memory). The hardware never makes secret decisions on the programmer’s behalf. The cost is that programming gets harder; the benefit is that latency becomes predictable.

CPUs, GPUs, TPUs: the same building blocks at different scales

A CPU core is enormous — maybe one one-hundredth of the whole die per core. Most of that area isn’t doing math. It’s cache, register file, and branch predictor.

The branch predictor exists because instructions take several nanoseconds to evaluate. If the CPU waited to know whether an “if” was true before fetching the next instruction, the clock would crawl. So instead, it guesses — predicts which way the branch will go, runs ahead, and rolls back if wrong. Whole sections of CPU silicon are dedicated to making good guesses.

GPUs strip a lot of that out. No branch predictor, tighter register files, more silicon for math. CUDA cores look like simpler, denser CPUs.

TPUs go further. Instead of thousands of small math units (CUDA cores or SMs), a TPU has just a few enormous ones — big matrix units with vector units in between. Reiner’s framing is striking:

“From a very high-level point of view, the GPU has a lot of tiny TPUs tiled across the whole chip.”

The trade-off is bandwidth versus utilization. A GPU’s many small units give you many small wires between them, so any-to-any data movement is cheap. A TPU’s huge units have only two perimeter edges connecting them, so data movement is expensive. But the huge units amortize their register file overhead beautifully when the workload is one giant matrix multiply.

The brain comparison

Dwarkesh asks: the brain has unstructured sparsity, memory and compute are co-located, and it runs at a much slower clock. Could a chip designed more like a brain be vastly more energy-efficient?

Reiner’s answer is a polite no. If you took a GPU and clocked it at one megahertz instead of one gigahertz, you’d use about 1000x less energy — because most energy on a chip goes into toggling capacitors from zero to one and back. Fewer ticks, fewer toggles. But you’d also get 1000x less work done. It’s not a free lunch.

The brain wins on energy per useful computation in ways silicon can’t yet copy. But “make the clock slower” by itself isn’t the trick.

What MatX is doing

At the end, Dwarkesh pokes at MatX’s design philosophy. The hint Reiner drops: splittable systolic arrays. Big arrays when you have a big matrix multiply to do. Small arrays when you need flexibility. Keep the data-locality wins of huge systolic arrays without giving up the bandwidth wins of having many smaller units.

Whether that works in practice — whether the silicon actually behaves as described — is the question MatX is being built to answer.

Key Takeaways

The fundamental primitive of an AI chip is a multiply-accumulate: multiply two low-precision numbers, add to a higher-precision running total.
A 4-bit-by-4-bit multiplier costs roughly p × q gates. Multiplier area scales quadratically with bit width. This is the single biggest reason FP4 beats FP8 by more than 2x.
A full adder doesn’t add 32-bit numbers — it counts the number of 1s in three input bits and outputs the count in two bits. It’s the workhorse of binary summation.
A mux (multiplexer) is how a chip “selects” one register from many. It costs n × p AND gates plus (n-1) × p OR gates. Most of a CUDA core’s area is mux, not multiplier.
Systolic arrays harden two loops of a matrix multiply directly into silicon. Weights sit in place; activations stream through. The compute-to-communication ratio goes from 1:1 to x:1.
Nvidia’s B300 generation finally admits FP4 is 3x faster than FP8, not 2x. The true answer is closer to 4x — quadratic scaling, again.
The clock cycle is a global synchronization tick across the entire 100-billion-transistor chip. Setting clock speed is a fight between adding pipeline registers (more area, more speed) and not adding them (less area, less speed).
Feedback loops in logic set the floor on clock speed. You can’t pipeline a running sum without changing what it computes.
FPGAs cost 10x more area than ASICs because every gate is implemented as a lookup table (a programmable truth table), not as fixed silicon. A 4-input LUT is 32 gates; the equivalent ASIC gate is 3.
FPGAs win commercially when you’d otherwise pay $30M in tape-out costs for a chip that changes every month.
Caches (in CPUs) make hidden hardware decisions about where data lives. Scratchpads (in TPUs) force software to choose explicitly. Cache = fast and unpredictable. Scratchpad = predictable but harder to program.
The biggest non-math area on a CPU is the branch predictor — silicon dedicated to guessing which way “if” statements will go five cycles before they’re evaluated.
A GPU is approximately “many small TPUs tiled across a die.” A TPU is “a few enormous matrix units.” Same primitives, different granularity.
Energy on a chip is mostly dynamic switching power — capacitors charging and discharging as bits flip. Run the clock 1000x slower, use 1000x less energy, but also do 1000x less work. The brain’s efficiency advantage isn’t just about clock speed.

Claude’s Take

This is one of the cleanest expositions of AI chip design I’ve encountered. Reiner builds the entire stack — gates, multipliers, muxes, systolic arrays, clocks, FPGAs, caches, GPU-vs-TPU — using maybe four primitive ideas and one recurring theme: every chip-design decision is some flavor of compute versus communication.

What makes the interview unusually good is the visual scaffolding. Reiner literally draws each circuit. The full-adder-as-bit-counter explanation, the Dadda multiplier, the mux-as-mask-then-collapse, the systolic array as a way to bake two loops of matmul into silicon — these all land because they’re built from the previous layer. You can follow it without a hardware background if you’re willing to sit with the diagrams.

The framing that “a GPU is many tiny TPUs tiled together” is the kind of compression that makes you see the whole industry differently. Tensor cores stop looking like a Nvidia-specific thing and start looking like a fundamental architectural primitive that Google had the courage to scale up further than Nvidia did.

What’s missing: power and thermals barely come up. The role of HBM bandwidth in the data-center system is hinted at but not dissected. There’s no discussion of how memory hierarchies actually inform model architectures (the FlashAttention story, say). And the MatX teaser at the end is deliberately vague — fair, given that’s their competitive moat, but it leaves you wanting.

Score: 9. Loses one point for the slightly compressed treatment of floating-point versus integer arithmetic and for not connecting the chip-level story back to what these decisions mean for which models train best. But as a from-the-gates-up tutorial, it’s hard to beat.