Advanced Skylake Deep Dive - Matt Godbolt
Advanced Skylake Deep Dive - Matt Godbolt
ELI5/TLDR
Your CPU pretends to run instructions one at a time, in order, like a recipe. It does not. Behind the scenes it is running a small bureaucracy: breaking instructions into smaller pieces, renaming variables so independent work can happen simultaneously, guessing which branches you will take before you take them, and maintaining an elaborate ledger so it can pretend none of this happened. Matt Godbolt, the person behind Compiler Explorer (godbolt.org), walks through every stage of this process on Intel’s Skylake chip, mostly by reverse-engineering what Intel refuses to document. The practical upshot: keep your code simple, avoid integer divides, and align your loops.
The Full Story
The Pipeline You Learned in School Is a Lie of Omission
The textbook CPU pipeline has four steps: fetch, decode, execute, write back. One instruction flows through like a car on an assembly line. Skylake’s actual pipeline is more like a factory with a front office that does speculative paperwork, a warehouse of workers who grab tasks in whatever order suits them, and a retirement department that pretends everything happened sequentially.
Godbolt breaks this into three zones. The front end fetches bytes and turns them into internal operations. The back end executes those operations out of order, whenever resources are free. The retirement unit restores the illusion of sequential execution.
The Front End: Turning Byte Soup into Work Orders
x86 instructions are famously irregular. They range from one to fifteen bytes long, a consequence of bolting four decades of features onto a 1970s design.
“It’s the sort of thing that happens when you take a design from the 1970s and incrementally add things to it while maintaining backwards compatibility… there’s bytes that say, ‘Hey, the next byte is now interpreted differently unless it’s Tuesday and the moon is rising.’”
The fetch unit grabs 16 bytes at a time, aligned to 16-byte boundaries. It does not wait to find out where the program is actually going. The branch predictor tells it where to look, and the fetch unit obeys on faith. By the time the CPU discovers there was a branch hidden in those bytes, it has already fetched two more chunks. If the guess was wrong, those chunks are garbage.
Pre-decode: Educated Guessing
The pre-decoder’s job is to figure out where one instruction ends and the next begins inside those 16 raw bytes. The best theory (from Agner Fog, the retired Danish anthropologist who is basically the patron saint of this field) is that it speculatively tries to decode an instruction at every possible byte offset simultaneously, then throws out the overlapping ones. It is not always right about complexity, but it never misidentifies boundaries.
The pre-decoder also spots a common pattern: a compare instruction followed immediately by a conditional jump. Since nearly every loop ends this way, Intel fuses them into a single internal operation. This is called macro-fusion — two full instructions become one internal operation.
The Decoders: Not All Created Equal
There are four decoders. The first one is the overachiever: it can handle instructions that expand into up to four internal operations (micro-ops). The other three can only handle single-micro-op instructions. So the theoretical throughput is four instructions or five micro-ops per cycle, in patterns like 4+1, 3+1+1, 2+1+1+1, or 1+1+1+1.
The worst case: a sequence of instructions that are each exactly two micro-ops. Only the first decoder can handle them, so you get two micro-ops per cycle instead of five. This is, as Godbolt notes, what happens when you divide.
A 32-bit integer divide on Skylake takes roughly 36 micro-ops. The 64-bit version is about 100 cycles. “Oh my god,” was Godbolt’s reaction upon first seeing the trace.
There is also a microcode sequencer, essentially a small ROM with its own interpreter. Anything requiring more than four micro-ops — locked operations, string copies, divides, cpuid — runs a little program stored in this ROM. Someone at Intel has the job of writing these microprograms. Godbolt met someone whose neighbor does this for a living.
Micro-fusion: A Space-Saving Trick
Separate from macro-fusion, there is micro-fusion. Many x86 instructions combine a memory access with an arithmetic operation (like “add the value at this address to this register”). Logically this is two operations: a load and an add. But the micro-op format is made wide enough to carry both, so it counts as one micro-op in all the queues and buffers. It only splits into two when it actually reaches the execution units. The distinction matters for throughput accounting.
The Micro-op Cache: The Happy Place
All that decoding is expensive. The micro-op cache (Intel calls it the DSB, because Intel loves acronyms) stores previously decoded micro-ops so the CPU does not have to re-decode them. Once code has been decoded once, subsequent executions stream directly from this cache, bypassing the entire legacy decode pipeline.
It is, by cache standards, deeply weird. On Skylake: 32 sets, 8 ways, up to 6 micro-ops per cache line, but no more than 3 ways can serve any 32-byte region of code. Godbolt suspects this limit exists to prevent the cache from being monopolized by code with many execution paths through the same region, similar to a problem that plagued the Pentium Pro’s trace cache.
Any branch — even one that is not taken — ends a cache line. This is why compilers align loops to 16-byte boundaries: so the fetch unit and micro-op cache can serve the loop cleanly.
The Loop Stream Detector: Brilliant but Broken
Between the micro-op source and the renamer sits a queue — a circular buffer of a couple hundred entries. The loop stream detector (LSD) notices when a backward jump lands inside this buffer. When it does, the LSD stops the entire front end and just replays the buffered micro-ops. No fetching, no decoding, no cache lookups. It can even unroll short loops up to 8 times to fill its 4-micro-op-per-cycle delivery bandwidth.
There is a catch.
“It’s disabled on Skylake.”
The OCaml community discovered that short loops using the high byte registers (AH, BH — the upper 8 bits of 16-bit registers) caused, in Intel’s words, “unpredictable system behavior.” The Debian bug report called it “nightmare level.” Intel could not fix the silicon, so they shipped a microcode update that simply turned the loop stream detector off entirely. An entire hardware optimization, gone, because of an edge case in register naming that an ML language happened to exercise.
Renaming: Where the Magic Happens
The renamer is, in Godbolt’s telling, the most important stage. It does three things at once, four micro-ops per cycle:
1. Reorder buffer write. Every micro-op gets an entry in the reorder buffer (ROB), a 224-entry ledger that tracks everything in flight. (Granite Rapids bumped this to roughly 500.) This is how the retirement unit later restores program order.
2. Register renaming. The CPU has only 16 architectural registers (the ones you see in assembly code). Behind the scenes, there are hundreds of physical registers. The renamer maps each architectural register to a physical one. Every time an instruction writes to, say, EAX, the renamer assigns a fresh physical register. The old mapping is recorded for later cleanup.
This is what unlocks parallelism. Without renaming, consecutive loop iterations fight over the same registers. With renaming, each iteration gets its own physical registers, and independent iterations run simultaneously. Godbolt demonstrates this with his sum-of-squares example: without renaming, 10 cycles per iteration. With renaming, 1.5 cycles per iteration.
3. Dispatch to execution units. The renamer decides which execution port each micro-op will use. This decision is made early and is slightly myopic — by the time the operation actually executes, a different port might have been free — but the scheduling algorithm (reverse-engineered with “high probability” of accuracy) balances load across ports sensibly.
Zeroing Is Free
The CPU recognizes XOR EAX, EAX (the standard idiom for setting a register to zero) and does not execute it at all. It just points the register at a permanently-zero physical register. No execution unit is used. No port is occupied. It is, genuinely, free.
“Doesn’t issue at all. So, although it’s written to the reorder buffer, it doesn’t go to any execution unit. It doesn’t take up any resources at all. It’s magic.”
Move elimination works similarly: MOV RAX, RBX just makes RAX point to the same physical register as RBX. No work done. But there is a limit of four active aliases before the system has to fall back to actually executing moves.
Alder Lake’s Party Trick: Free Increments
On Alder Lake and later, the renamer can absorb small adds and subtracts (within an 11-bit range, roughly plus or minus 1024) by just annotating the register mapping with an offset. INC RAX becomes “RAX is P88, plus one.” No execution unit needed. The cost is deferred to whatever instruction eventually reads the value — except for shifts, which inexplicably pay an extra cycle for it. The community’s best guess involves barrel shifter setup times, or possibly just a bug.
The Back End: Where Work Actually Happens
After all that front-end bureaucracy, the actual computation is almost anticlimactic. Micro-ops sit in reservation stations (also called schedulers) until their inputs are ready and an appropriate execution unit is free. Then they execute, results are broadcast on a write-back bus, dependent operations wake up, and the cycle continues.
Skylake has execution ports with specialized units: three integer/vector ALUs, two shifters, load units, address generators, and a store data unit that oddly lives on the ALU side rather than with the load/store units.
The Memory Order Buffer: Organized Paranoia
Everything the CPU does is speculative until retirement. Stores cannot go to real memory until the instruction that caused them is proven correct. But loads that come after a store need to see the stored value. The memory order buffer (MOB) handles this contradiction.
The store buffer holds uncommitted stores, split into address and data components. The address is often known long before the data (you might know you are writing to address 20, but the value involves 500 cycles of square roots). This separation lets the CPU check whether a later load conflicts with a pending store without waiting for the store’s data.
When a load arrives, it checks the store buffer first. If it finds a match with no ambiguous earlier stores, it gets the value directly — effectively an L0 cache hit, faster than L1. If the store buffer is clear of relevant entries, the load proceeds to L1 normally. If there are unresolved stores that might conflict, the load waits. Sometimes the CPU predicts that a load will not alias any pending store and lets it go early. When this prediction is wrong, the penalty is severe: a full pipeline clear.
Denormals: The Performance Cliff
When a floating-point unit encounters a denormalized number (too small to represent in the standard format), it cannot handle it in hardware. It flushes the pipeline and asks the microcode sequencer to generate software that can. This is a JIT deoptimization happening inside the CPU itself.
Retirement and Store Commit
The retirement unit walks through the reorder buffer in order, marking completed instructions as done and freeing their old physical registers back to the free list. It handles four micro-ops per cycle. This is also where exceptions are actually raised — a divide-by-zero that happened speculatively on a path that was never taken just gets quietly discarded.
After retirement, stores are marked “senior” in the memory order buffer and gradually drain to real memory, coordinating with other cores via the cache coherency protocol. This is where false sharing and cache ping-ponging actually bite.
How Anyone Figured This Out
None of this is documented by Intel. A small community of researchers reverse-engineers it using hardware performance counters (which have revealing names even when Intel tries to be cagey), carefully constructed microbenchmarks, and a lot of inference. The key figures: Agner Fog (retired Danish anthropologist, the original reverse-engineer), Travis Downs (who maintains uarch-bench), and Andreas Abel (whose master’s thesis cracked the renamer and port allocation algorithms).
The work requires physical hardware. Virtual machines in the cloud expose only a fraction of the performance counters. And Intel maintains multiple tiers of documentation secrecy — the “pink books” and “yellow books” that large customers can access, which the general public cannot.
“I had some very off-the-record conversations with some contemporaries around the Spectre and Meltdown times. They were like, ‘Well, yeah, the pink thing, it talks about the…’ And I’m like, ‘What are you talking about? Whoa, what?’ And they’re like, ‘Have you not got those?’ I’m like, ‘No.’ And they’re like, ‘Oh, we shouldn’t have said anything.’”
Researchers even discovered that bit 5 of branch addresses was not mixed into the branch predictor’s hash function, effectively splitting the predictor into two independent halves. This was used to build a Spectre mitigation: JIT-compile user code so all its branches have bit 5 set, and ensure all security checks have bit 5 clear. Two isolated prediction domains from one undocumented implementation detail.
Claude’s Take
This is an exceptionally good talk. Godbolt has the rare ability to explain something genuinely complex without dumbing it down or losing the thread. The pacing is right. The examples are well-chosen. The honesty about what is known versus inferred versus guessed is refreshing — he flags uncertainty clearly and consistently.
The content is solid throughout. The CPU pipeline material is well-established computer architecture, and Godbolt’s presentation aligns with the published research from Fog, Downs, Abel, and others. Where he speculates (the broadcast bus theory, the shift penalty explanation), he says so. The claim about register renaming turning 10 cycles/iteration into 1.5 is demonstrable with performance counters and is not controversial.
The OCaml/LSD story is real and well-documented. Intel’s response (disabling the entire feature via microcode) is confirmed in multiple sources. It is a genuinely remarkable incident in processor history.
Two things worth noting. First, this is Skylake-era material. The talk acknowledges this, but the audience should know that Golden Cove and later microarchitectures have substantially different numbers (wider rename, bigger ROB, more ports, fixed LSD). The principles are the same; the specifics have shifted. Second, the “pink books” anecdote is uncorroborated gossip — plausible and widely believed, but not something anyone can verify publicly.
The practical advice (avoid integer divides, align loops, keep code simple, trust the renamer) is sound and non-controversial. The deeper value of the talk is not the advice but the mental model: understanding that your CPU is running an elaborate speculative bureaucracy beneath every instruction, and that the performance characteristics you observe are emergent properties of that bureaucracy rather than inherent properties of the instructions themselves.