Transcript: Advanced Skylake Deep Dive Matt Godbolt

TITLE: Matt Godbolt: Advanced Skylake Deep Dive CHANNEL: Jane Street DATE: 2025-12-23 ---TRANSCRIPT--- Hi everyone. Welcome to this talk. Everyone could make it. I’m here to introduce Matt Godbolt. Matt Godbolt is a C++ developer who loves looking under the hood of compilers, operating systems, and silicon. By day, he writes software for finance, like many of us here. By night, he emulates vintage computers and maintains Compiler Explorer. Well, Compiler Explorer is the official name, but everybody knows it as Godbolt. So, naturally, when someone set up the Jane Street internal instance, they named Godbolt, making Matt the first external speaker who already has an internal domain name named after him. Please join me in welcoming Matt Godbolt. [Applause] Thank you, Jasper. Well, thank you for having me. This is fantastic. I’m looking forward to talking to you about the kind of things that I love digging into. So, this is microarchitecture watch what happens beneath, which is not my best title. So, yeah, as Jasper says, Compiler Explorer is why you probably heard of me before. I love making emulators roll computers. That’s what I’ve been doing for the last year as I’ve been on a non-compete. My non-compete ended today, which is why I’m able to be here talking to you. But, next week I’ll be working at HRT. So, we’ll be neighbors. I’ll be up the street. Friendly competition. I also love trying to reverse engineer what the heck is going on inside your CPU. Right, Intel famously doesn’t give a lot of information about what’s going on inside the computer. And that’s for obvious reasons to do with, you know, intellectual property, whatever. But, there’s a sort of small but dedicated community of folks who are trying to work out what actually is going on inside. This is my contribution to it where I was trying to work out how the branch target buffer worked inside some of the older chips. We’re going to be talking about some old chips today, unfortunately, because I don’t have access to some very, very new stuff because I’ve been on non-compete for a year, and it’s very expensive to rent them. But, anyway, I’ve done some research before. And my sort of claim to fame on this is that I was cited in the Meltdown and Spectre paper, which was a kind of a nice moment of recognition for this kind of stuff. You’ve probably seen a CPU pipeline before in textbooks. And I know I’ve been speaking to some folks who are now sitting very threateningly in the front row. And they have a very deep understanding of everything we’re going to cover here. But, for the rest of you all, hopefully you’ll forgive me doing a quick sort of introduction reminder of what a CPU pipeline looks like. So, when I grew up, this is what a CPU pipeline looked like. You fetched an instruction, you decoded it, you executed it, and you wrote it back in sequence. And then, you know, you could do this as a sort of production line, as a pipeline, one step at a time. So, one instruction was being fetched while the previous one was being decoded, and the one before that’s already being executed, and the one after that before that is being written back. And therefore, although it takes four cycles to get through here, we’re still doing one useful piece of work every clock cycle. That was a big win. Modern CPUs though realized that if we can do better than just this sort of single step at a time, we can break the system into sort of three broad categories. We’ve got a front end where we’ve got a branch prediction unit. I wanted to talk to you a lot about branch prediction. I actually did a whole bunch of reverse engineering work for it, but I don’t have time. So, we’ll have to talk about it afterwards or in the pub or something like that. But, anyway, there’s a fetcher decoder as before. We have a rename step, which is sort of the gateway into a new world. And then, after that, this middle section here, this is the back end. Things can happen out of order, which sounds counterintuitive, but in programs very often we write code which has lots of little parts that aren’t totally dependent on each other. You know, you read a value, you add 10 to it, you store it somewhere else. You read another value, you add 20 to it, you store it somewhere else. It’s like, well, those two things could happen at the same time. And so, if you can pick up enough instructions in one go, you can actually do more than one piece of work per clock cycle. And that’s what this sort of stack is here in the execute. And that’s what the out-of-order system is doing. It’s doing some clever dependency tracking, and then everything happens in whenever execution units are available and results inputs are ready. And then, finally, we come to the retirement stage, and that’s where everything gets put back in the order that the program was written in so that you don’t notice the monkey business that’s been going on in the middle bit. Okay. But, we’re going to be talking specifically about Skylake, which is old, 2019, 2020-ish era is when it was out. But, again, my laptop is a Skylake. My desktop machine is a Comet Lake or Skylake X, actually. Yeah, the laptop’s Comet Lake. So, these are representative of the kinds of things that actually happen in a modern CPU, but they have changed into kind of server quality CPUs that you’ll be using in your day-to-day job. Where I remember them, I will try and call out the differences. But, again, I couldn’t test the things that were on more modern machines. So, out-of-order on Skylake is a little bit more complicated than that nice textbook diagram. We have a front end that looks like this. Again, the branch prediction unit we’re not going to be talking about. We have sort of two parallel lines, the fetch predecode instruction queue, and then many decoders. Beneath that, we’ve got a micro-op cache. To the right of that, we’ve got a queue of micro-operations, and amusingly named LSD. Then, we hit the renamer. We’re going to talk about all of these things in a second. I just wanted to give you a big picture of like where we’re going with all of this. There What was I going to say about this? There was something I meant to say. It’ll come to me. We’ll cover it anyway. So, anyway, this is the front end. At this point here, by the time Oh, that’s right. Micro-ops. I’ve written micro-op here. So, instructions are specifically on Intel are broken into smaller operations, micro-operations. So, unlike most RISC processors where you’re just like adding two numbers together or you’re reading or you’re doing a branch, some of the Intel instructions are crazy, and you can like do some crazy address calculation, which is, you know, one number plus another one times four plus some offset, and that’s the address that you’re going to read from. Then, you read from that address. Then, you’re going to add to it, and then you’re going to store back to it. And that’s one instruction. So, it makes sense for the poor front end here, which deals in terms of those crazy x86 instructions, to break it down into a sensible RISC architecture in the back end. But, we never get to see that. We just have to infer its presence from all the measurements we do on the outside. So, that’s what a micro-op is. Yeah, so by the time we get to this point here, we have a stream of micro-operations that are in program order. They get to the renamer, and then they reach the back end, which is a little bit smaller and simpler, and we won’t be covering as much of this. If I’ve got my timings right, we’ve probably only got 5 minutes of this at the end. But, we’ve got a scheduler. We’ve got multiple execution units, not the four that are here. There are actually many more than that. And then, there’s a write-back stage. Then, we have the retirement, and then post-retirement, we have the store buffer commit phase. And again, each of these steps is probably a simplification, but this is mostly built out of measurements that we can make. And by we, I don’t mean me, actually. Although I have contributed to this in the past, that the community of people that reverse engineer all this kind of stuff, I’m relying very much on their results that I’m presenting here. These things are not necessarily found in the Intel manuals. You have to go and dig them out yourself. So, I have some references at the end. If you’re interested in this and how you can measure it yourself, knock yourself out. It’s a really, really interesting rabbit hole to go down. This selection of TLAs, this soup of letters over here, are all the sort of shared data structures that each of the bits of the system can see. We’ll talk about those separately, but I wanted to call them out to begin with. We’ve got a physical register storage. So, while you tend to write programs or C programs in assembly that refer to, you know, EAX and ECX and EDX and whatever, the 16 registers that we have access to. 16, what a luxurious number. I know we were talking about earlier that we probably could do with some more, but like it wasn’t that long ago we only had eight of them, of which two of them were used for other purposes. But, that is small fry compared to the number of register slots that are actually available on the CPU. And so, that behind the scenes, it’s doing all sorts of allocation and renaming and mapping to take advantage of the extra space it has on the chip. So, the physical register file contains all of the actual values of the registers, and there are probably hundreds of those. The RAT is the register alias table, and that’s the structure that maps the current EAX to where on earth it is currently on the chip. And other things. There’s a reorder buffer, which stores essentially a ledger of all of the micro-operations that have flowed through the system. And is essentially doing sort of the reference keeping of like this is the point at which we know an instruction has been issued, and later on we will retire it in the order that they came through into this reorder buffer. Then, there’s a reservation station, which is also called the scheduler in some literature. That’s where we actually store the things that have yet to be processed. Like the actual operations themselves are like, I need to do an add of this number and that other number. Or this number and a number that we’ll get when we read that it depends upon has completed. They live in the reservation station. Branch order buffer is a sort of checkpointing system for branch misprediction. Again, I wanted to talk more about this, but sadly, we don’t have time. But, it’s fascinating. If you’re sitting down and think about what the thing must be doing when it’s predicting, you know, 20 branches in the future, and then it gets one of them wrong, and it doesn’t want to throw all of the work away. And then, there’s a memory order buffer, which is responsible for making sure that loads and stores to memory still make sense even though we’ve reordered everything. So, we’re going to start with the front end. This is going to be the majority of the talk. I’m going to use this example, and I sort of it’s thematic that I have to do it in C++. Although we did attempt to write this in OCaml earlier. It did not generate the same code. I will leave that to your imagination. So, it’s a bit of a made-up thing. It’s taking an array of integers, it’s squaring them, and summing them up. So, it’s just doing like a rolling sum of all of the integer sum squares, effectively it is. And it suits my purpose because it compiles to six instructions and exactly 16 bytes of machine code. So, I’m sure you’re reasonably familiar with this, but on the right-hand side we have this text representation, which is the assembly code, and on the left-hand side is the bytes that the CPU actually reads out. That’s the machine code. It used to catch me all the time. I used to use the two interchangeably, which is foolish. And you know, you don’t really need to know about it. It’s like we read the array pointer, we square it, we add it to our running total, we move the array pointer to the next element, we compare it to the end one, and then we jump back if it isn’t at the end. Yeah. Pretty straightforward stuff, 16 bytes worth. What you’ll notice though, immediately, is that each of these is a different length. We’ve got a two-byte instruction, a three-byte one, there’s even a four-byte one here. In general, x86 instructions can be anything from one byte to 15 bytes long. And it’s not sensibly written because it’s kind of the sort of thing that happens when you take a design from the 1970s and incrementally add things to it while we’re maintaining backwards compatibility, which means that sort of everything has harks back to these ancient ways, and there’s like bytes that say, “Hey, the next byte is now interpreted differently unless it’s Tuesday and the moon is rising.” So, it’s really quite complicated to disassemble x86. Anyone who uses ARM or MIPS or RISC-V or whatever, it’s lovely. They’re all the instructions are the same length, so it’s very simple. But that’s kind of why we need to do this complicated decode. Another thing I’m going to quickly show you is an instruction trace of a sort. This comes through a somewhat heavily modified tool called uica, which is the microcode analyzer that Andreas Abel, who’s now at Google, has written and open-sourced. And this sort of lets us see an individual instruction’s journey through the pipeline. So, what we’re seeing is one instruction and then all the stages it goes through from left to right from time base. So, this instruction takes 16 cycles. On the first cycle here, it’s being pre-decoded. It’s hitting the instruction queue here. It’s being issued, dispatched, executed, and then retired in this particular just want to show an example. All right, first stage, fetching. This is probably the easiest stage, although I’m going to gloss over a whole bunch of really complicated stuff that it does. So, we have the predicted instruction pointer. So, the fetch unit doesn’t wait for the program to definitely go somewhere. It’s being told where to go by the branch predictor at all times. And so, that’s kind of a really interesting aspect that I had never even considered because you might pick up a bunch of 16 bytes. And until you’ve decoded it, which is like four or five steps down, you don’t even know if there’s a branch in there. And by that time, you’ve already had two more 16-byte chunks of memory that you’ve pulled up, and you’re like, “Well, I went the wrong way, right?” So, the branch predictor’s got a great job. And it does by and large a good job. So, the fetch unit picks up 16 bytes at a time at a 16-byte boundary, which means that if you branch through the middle of a 16-byte boundary, then you’re immediately wasting a bit of bandwidth there because you’re missing out on some of the bytes that you could have picked up in one cycle. The other thing to note here is that I’m not going to talk about caches at all because that’s another two hours talk. But this obviously has to liaise with the TLB to go and work out where the heck these instructions actually come from and the L1, L2, L3 caches to actually fetch the bytes into 64-byte cache line to then pull 16 bytes of it down. So, there’s a lot of work hiding here. But the result of this fetch stage is some kind of pipeline of chunks of 16 bytes, presumably with some kind of address that’s tagged with them, so we knew where they came from. And it can be checked later on. All right, easy one done. Because the x86 instruction set is so complicated, and because what we want to do is unlock this parallelism where we can run more than one instruction at a time, if the most that we could do in a single clock cycle was decode one instruction, then we’d be missing a trick down the line. And so, what we want to try and do is decode as many instructions as we possibly can out of this block of 16 bytes in one clock cycle. So, it’s broken down into stages. The first stage here is called a pre-decode stage. And the pre-decoders have a set of kind of flaky heuristics about what these bytes might mean. They’re flaky because the only thing that we can think of, again, we being in this instance Agner Fog, is that the only way it can be decoding this because we have no idea where the instructions lie in here. All we know is they don’t overlap with each other, and they depend, of course, on having decoded the previous instruction. And so, how could you possibly do it in one cycle? The only thing we think of is that it just speculatively tries to decode an instruction at every possible offset, and then it has a sort of filter step where it says, “Well, these ones overlap with the one that’s come before, so that can’t be a valid decode.” And again, it’s a bit of a heuristic hack. It never gets it wrong, per se, but it sometimes conservatively tags instructions for a later stage of the pipeline that are tagged as being more complicated than they actually are. So, these 16 bytes happen to represent exactly my routine that I showed. The pre-decoder tags it. It essentially just breaks it down and says, “Well, this is the opcode byte. These are the opcode bytes, whatever.” And the interesting thing here is that what we’ve done is we’ve been able to discover where the bytes of the first instruction, which is that mov, the second one, which is the imul, the add edx, the add rdi here, and then for this last one, this is a compare and a jump. And at this stage in the pipeline, it spots an opportunity because although we treat them as two separate instructions, we have to. Well, that’s what the ISA is. It’s so common, you know, every loop ends in a compare and jump if not equal or something like that. The Intel have decided to say, “Well, what if we had an instruction which was just a compare and jump as one unit?” Well, we could make a new instruction, we could issue it, and then write all the compiler writers, you know, actually emit this instruction, or we could just spot it in the instruction stream and replace it here. And that’s what they do. This is called macro-instruction fusion, and it’s tagged, at least, in this pre-decode step. So, we still don’t know really what these bytes are. And again, sometimes the bit masks or whatever it does are wrong. The limitations of the pre-decoder, it can do up to five instructions in a cycle, and it does this macro-op fusion. Now, at the moment, and by the moment I mean for Skylake, it may have changed, but I don’t think that this one has. A compare and test on anything that isn’t memory, which is therefore more complicated, followed by any branch, that’s treated as a single operation. An add, subtract, increment, decrement, or an and followed by a branch on these limited set of instructions is treated again as a single unit as a compare and branch. It’s worth saying that you probably don’t do this, although you write compilers, so you might have the latitude to do this. Don’t put any length-changing prefixes. These are usually used in like the crazy 32-bit modes to use smaller pointers, but they effectively snarl up the pre-decoder. It just goes, “Whoops, I don’t know.” Because you can make an arbitrarily complicated instruction out of just putting random-ass prefixes in it. And it just gives its hands up, and it takes I think it’s three cycle penalty if it sees one. It’s like, “Whoop, I’m out.” And yes, the other thing that it does do is it notices whether or not an instruction is simple or complex. And so, it steers that instruction to the appropriate decoder. So, these sort of partially decoded instructions, which again are just blocks of bytes and presumably some kind of sequence. So, we’ve got partially decoded instructions, and they’re passed on to the decoder. The decoder, there are four units, and they are passed them in order. The result of which is the micro-operations. In this particular example, every instruction is one micro-operation. So, each one on that side is one thing on this side here. So, one of them is like a 32-bit load through rdi, a multiply, an add, an add, whatever. And then this one is a macro-fused one, in which I’ve made up an instruction of compare and jump. There are two kinds of fusion. One is the macro-fusion we’ve already talked about. So, macro-fusion, macro is big. This is how I remember them. That means that two big things, the instructions, have been fused into one internal operation. So, we’ve talked about that. Micro-fusion, on the other hand, is well, this stupid ISA that Intel have come up with, it’s so common to have an address operand on the right-hand side. While this is logically two operations, that is, an add rax, comma r14 is go calculate whatever r14 is, go read that from the cache, and then come back, and then go to the ALU and add it to rax, right? That’s two distinct operations. Practically, for most of the time, what if we just made our micro-op format a little bit wider and had like an optional memory tag, and then we could put for most instructions that we’ve got space, we can say, “Well, this is both an add and a read.” And for the simpler cases, that can happen. And so, a micro-fused instruction will later on be actually executed by two different parts of the chip as two micro-operations, but at this point, we can pack it into one micro-op in all the queues and things that are coming up. So, we’ve got micro-fusion, macro-fusion. There is also a microcode sequencer. The microcode sequencer is effectively a ROM with a sort of simple interpreter put on top of it. And so, anything that’s more complicated than four micro-operations needs to come from a ROM rather than just being sort of hardwired in the code. And that ROM can either just literally puke out a bunch of micro-ops. Here’s 15 micro-ops for a complicated instruction. Or it can actually have logic in it itself. It can actually go like, “Oh, okay. Well, while ECX is not equal to zero, keep issuing these micro-ops, right?” And so, for things that are complicated like locked operations, which I’m sure anyone who’s done threading has seen, that takes more operations than four. Or obviously things like rep stores, which is a repeating operation. This is effectively just a microcode sequence of just keep emitting the read and write operation until you hit the end of the thing that you’re copying. And obviously complicated things like read MSR or cpuid. They effectively run a little program. You know, it’s going to go off and say, “Hey, what kind of CPU are you?” Well, I can return that. So, that’s kind of cool. I actually met somebody who worked at Intel. He said, “Oh, yeah, my neighbor writes microcode.” I’m like, “What a crazy job that must be.” And then the other thing that’s sort of done by the microcode sequencer are things like divides. I know they’re also complicated, you know, more than four or five micro-operations, but it was kind of an opener to me. I was like, “Oh, that’s how divides work.” You know, it’s got all this acceleration circuitry, but that’s how they can short circuit. It’s like, “Well, okay, we’ve reached the point where there’s no more bits coming out. We can stop here.” So, we’ve got four decoders, but not all decoders are created equal. The first decoder is the real decoder. It can do up to four micro-ops. So, any instruction that is four micro-ops or less can go and be passed by the first decoder. Anything that’s also tagged as complex, even if it actually turns out to be a simple operation, gets put to the first decoder and the other ones won’t touch it. Any instruction that requires a switch to the microcode sequencer has to go through the first decoder as well. So, things like that divide, the first step will be, “Hey, here’s the setup code.” And then now we’re going to start reading from the microcode sequencer. Decoders two, three, four can only do a single micro-operation. So, only simple things. But this is actually the fused micro-op. So, those adds with memory, that still counts as one in this world, even though later on it’s going to be broken into two. So, you can decode four instructions or five micro-ops, which means that like if you really need to get lots of micro-operations out, you can have a four micro-operation instruction followed by a one micro-operation instruction, and that can come out in one clock cycle. And then the other two decoders just have to sit and twiddle their thumbs. 311, 2111, or 1111. And luckily, most things fit into this world, unless you’re doing something silly like dividing again. The worst case you can possibly do is just have a sequence of instructions which are exactly two micro-ops each because they only go to the first one and all the other ones are idle. And then the next clock cycle you generate two. So, that’s the worst case. But although I’ve just banged on about this and it sounds interesting and complicated, and Intel refer to this as the legacy decode pipeline, we’re mostly saved by the next thing we’re going to talk about, which is the micro-op cache and the loop stream detector. Just to show you some pretty rainbow diagrams. These are some other more complicated instructions that come out as being two micro-ops, three and four. A push RAX is like, “Well, I have to write RAX to the stack pointer and then I have to update the stack pointer.” Makes sense? Exchange RBX RAX is probably temp equals RBX, the whole switch thing. And then an add pointer here is one of those really complicated memory operations where we have to read the memory, we have to then add to it, and then we have to write it back. And writing back is always two ops because the address generation is separate from the data part of the store. And then yeah, because I have to put a divide on here. This is a 32-bit divide on Skylake. They’ve got a lot better. I mean, I’m sure Granite Rapids is a lot quicker than this. Yeah, it’s like 100 cycles for the 64-bit version. But what the interesting points are here is like so we can see that in the first clock cycle, the decoder zero was able to output these three micro-ops. They kind of flow through. And then it switches to the microcode sequencer and there’s a two-cycle delay here. And now it kind of carries on doing all of its whatever magical stuff is going on here. So, yes, we’re saved by the micro-op cache. The micro-op cache sort of sits logically to the side of this whole complicated pipeline we’ve just been talking about. And we’re in one of two modes. Either we’re caching or we’re reading from the cache. So, unlike the regular sort of L1, L2, L3 caches, it’s not like every read goes and says, “Are you there?” Okay, no, I better go and do it the hard way. It’s just you’re in one mode or the other mode. And there’s jumps that determine whether or not that process starts. So, yes, in caching mode, as we’re running through our program, the micro-ops that the translation engine has been doing, sorry, MITE is the microinstruction translation engine. As again, Intel loves coming up with complicated names for things. They get pushed into this cache and then they’re just sort of written into the cache. Or if a jump happens and the part of the branch prediction or whatever part notices that the destination of that jump happens to be in the micro-op cache, then we start streaming instead from the micro-op cache. And this is the happy place. You want to be in this situation. The micro-operation cache is the weirdest cache you’ve ever seen because it has a really difficult job. If you think about a normal cache, like an L1, L2, whatever, you’ve got a one-to-one mapping. It’s like, “This byte is in that cache line.” And then you just fetch the whole cache line around it and you whack it in the cache and you’re done. But if we jump to an address that’s not on a cache line boundary, then we will decode that instruction, but we can’t reverse back and say, “Well, what was the instruction before it then?” You just have to kind of start at that point. And so, there’s a lot of really weird restrictions on it. This has changed a lot since Skylake, but the Skylake has 32 sets by eight ways. And each way cache line has up to six micro-operations in it. Some of the micro-ops take two slots. So, like if you’ve got one of the MOV with a 64-bit value, you need to have two slots in order to sort of store the 64-bit value. It is inclusive with the L1I, which is a clue into how it’s implemented. And then the weird thing here is that there are no more than three ways can be used by any 32-byte of code. And this to me smells like they had hit a real snag because morally this is very similar to the trace cache that the Pentium Pro used. And the problem with that was that you have as many entries in the cache as there are traces through your code. And so, very quickly you blow up as the traces change. So, I think they try and limit that by saying if any block of memory, small area of the program, needs more than three lines, then we’re probably in this case where it’s about to start monopolizing the whole cache. Like we’d rather throw it away and prevent it from being cached at all. It’s worth saying that any branch, even a not taken branch, ends a cache line and then it starts caching from the next thing. So, it kind of tries to find the sensible boundaries. But the one of the take homes from this is if you’re generating code, if you have more than one entry point into the same 32-byte block, then you might be not using the DSB as effectively as you like. And you know, that’s why we align loops on 16-byte boundaries typically so that you get the benefit of the fetch system picking up the loop. In newer machines, it is more like you can have six ways in 64 bytes. So, it’s a bit of a weirder thing more closely tied to the L1 cache. And then you can deliver anywhere between four and six micro-ops per cycle, which means that even if they’re complicated or simple or whatever instructions, you can get a pretty consistent stream out of it. The literature says six. I could only ever measure four. The loop stream detector. So, in between this sort of source of micro-operations that’s coming from either the legacy system or the cache, there is a queue. Obviously, you know, we love queues. Queues buffer if the renamer or the execution unit is sort of backing up a little bit, we’ve got a bit of space in here. But this queue effectively is a big circular buffer of a couple of hundred entries that we’re writing our microcodes in program order in. And what the loop stream detector does underneath is it says, “Wait a second. You’ve jumped to a location that’s actually already in this buffer.” You know, it may have already popped off the end of it, but we haven’t overwritten it yet. Wait a second. Wouldn’t it be better if we just kind of held that part and just kept streaming back that same sequence of events over and over and over again? And so, that’s what the loop stream detector does. And that means it can turn down the whole front end. Then there’s no caching that has to be done. There’s no decoding. It’s just brilliant. But it’s relatively small. And it has limitations. So, it spots loops that fit in the queue entirely. It can deliver each cycle up to four micro-operations. And it can’t deliver more than one loop iteration. So, if you have a really small loop or a loop that doesn’t fit into a multiple of four, then you might get like if it was five, you get four in the first cycle and one in the next one and then four and then one and four and one. Except, it can unroll. It actually spots the longest thing it can fit get up to eight times in the Skylake. So, it’ll actually unroll that loop effectively in hardware and then in every cycle it can stamp out four, which is pretty cool. Which is great. But why was it in parentheses and why did it have a star? Well, it’s disabled on Skylake. Which was reported by the due to a massive thing. It was described as a nightmare level bug in the Debian fix. Short loops using the AH BH you know, the high part of the 16-bit register and the corresponding wide register may result in unpredictable system behavior. That is scary stuff. Something about the OCaml runtime or the OCaml compiler liked to do this. Like in a loop it would use the high eight bits and the low eight bits. Maybe tagging. It was found by some folks in the OCaml community and reported and then Intel determined it couldn’t be fixed and they just issued a microcode patch which turned the whole thing off. A renaming. So, now we get to the cool part here. The renamer has probably the biggest job in the entirety of the front end because it has about three different jobs. The first thing that happens is that now the micro-ops have reached this point. This is like the first point where we can say, “This is definitely going to happen.” I mean, definitely. It’s definitely speculatively going to happen. It might be undone later on when we discover we shouldn’t have done it this way, but we’re going to write it into the reorder buffer which just says this instruction happened or will happen. We also rename to physical registers. So, we take the EAXs and the RDIs and whatever and we use this vast array of on-chip resources to actually use for registers. And then it also takes the micro-operations that were represented and shifts them off to the right execution units to sit to wait. So, they’ll sit in those schedulers until either the execution unit is ready to accept a new operation. Maybe there’s a multiplier that’s backed up waiting and there are other queues other instructions ahead of it or maybe this instruction depends on the result of a previous instruction that hasn’t completed yet. And so, it’ll sit there and wait until all of its dependencies come in. But this is the point at which we kind of like fan out into all the various data structures and then the out of order system picks it up on the other side. Amazingly, although it’s doing all this complicated work, it can do four micro-ops a cycle. This is staggering. Maybe this is the kind of thing that in hardware is a lot easier than it sounds. So, the first thing we do is the reorder buffer write. This is just to have our ledger for in order completion later on. There is a process called unlamination that happens at this point. So, at this point sometimes the micro-op that came through, even though it wasn’t micro-op fused, is determined to be more complicated than I can do in a single unit. And so, it’ll break it up at that point into multiple pieces. Some of the more complicated addressing modes sometimes fall into this category. There are 224 entries in the reorder buffer. That is the sort of top end of how many things that can be possibly in flight at once that are in the system being executed. That’s more like 500 on Granite Rapids and above. So, that’s what a place that they have changed a lot and expanded. The issuing step is where we actually take the live to be calculated micro-op and give it off to these execution units. There are the execution units in two big blocks that have many ports on them. It’s a weird terminology. But there’s an ALU unit that has all of the things to do with arithmetic and logic as you might imagine. And then there’s a sort of memory system which only does loads and address generation. And they have different numbers of entries. The issuing determines also which port within each of these. So, these are further subdivided. But at this point, this is when it’s decided where it’s going to ultimately go. So, even if it’s an add and there are like three different locations on the chip where that add could happen, we decide now at rename time which of the adders it’s going to go to. Which is a little bit myopic because by the time it actually gets ready to run, maybe that’s not the best choice of adder now. Maybe there’s some free adders somewhere else. But it’s useful to know. And obviously, it’s a simplifying assumption here. The algorithm has been reverse-engineered with high probability that they got it right. And it’s kind of simple as you might imagine. It’s to do with the balance of how many operations have gone recently. And it tends to pick the higher port numbers if they’re more free if there’s a tie. And then of the four instructions issued, one and three go to the best place to schedule them and two and four go to the second place. So, it kind of tries to balance them as well. And at this point also now separate from unlamination which actually makes two entries in the reorder buffer for an instruction that we notice later on is a bit more complicated. Anything that’s actually micro-fused, that is like the add with the RDI, gets separated into the load component that’s going to go over here into the load unit and the add part that’s going to depend on the load that’s going to go into the ALU component. So, the difference between unlamination micro-ops and micro-fusion is that this only has one sort of slot in the reorder buffer, whereas the unlamination stuff actually takes up more resources in there. Not that that’s a huge problem with 224 of them. So, to again, I think really the reason for me putting this up here is to sort of show you what information is being stored in each part. We have a bunch of micro-operations. These are no longer in any particular order. And we have maybe three sources that can happen and we’ve got a destination that it’s going to write to and it’s an add. Now, I’ve been unable to find anything that tells me whether or not the values of known quantities are stored in the scheduler in this sort of micro-operation or it’s just a reference. Now, the way that I’ve chosen to show this is that this one has two resolved sources. So, these have already completed the instructions that generated these two numbers. And then for these guys here, I’ve shown that like I’m waiting for the previous instruction that’s this guy in fact. He’s going to write to P24. I’m going to wait for him. And then when he’s completed, he’ll tell me what the results are. We don’t know if that actually happens as in this broadcast action causes the issues things to get written. There’s some circumstantial evidence that maybe it is because there’s a broadcast bus. But it also opens more questions about what happens if you are issuing instructions that have two known quantities and then we have to read an arbitrary number of physical registers in one clock cycle. And no one has been able to find a limit to how many you can read, unlike Sandy Bridge which had a very very limited number of read ports. So, something magic is happening, we don’t know what. Let’s talk about renaming. So, what renaming is going to do is unlock more out of order opportunities down in the rest of the pipeline. And what we’re going to do is we’re going to break any dependency that we can because many times when you say EAX, we just mean the current value of EAX and we’re going to do something to it and then we’re going to throw it away and then we’re going to load something new into EAX and we’re going to just keep doing this process over and over again and those things are separable from each other. And so, one example of something where the values of the registers don’t depend on an earlier value of the register is a loop because, you know, we’re going to read into EAX every time around the loop and we’re going to do some work with it and then we’re going to store it into a value or whatever and then we’re going to start the loop again. So, if we can break that dependency, it means that we can run maybe two iterations of the loop at the same time or even more. So, the way that this works is we have a table that tells us where each register currently is. So, which of the physical registers currently contain the most contemporary value of RAX. So, RAX is P39 and whatever. And then there is a free list of registers. These are registers that are currently spare chip real estate going spare. For every time we read from a register, we’re going to use its current value in the RAT. Every time we write to a register and completely replace it, it might as well be a new register. So, we’re going to pick a new one off the free list, use that instead and then update the table to say this is where the RAX register is now. So, this first instruction we’re reading the RDI register and we’re going to go and fetch memory from it and then we’re writing to EAX. It really doesn’t matter what was in EAX before. So, P39, don’t care about you. So, what we’re going to do, we’re going to rename this to be P45 and that’s me popping off the front of the free list. This is in hardware, remember. And then P22. And then this mul here is EAX with itself, right? I’m squaring EAX and I’m putting the result in EAX. So, when we read the EAXs, we’re going to use P45 and then when we write, we’re going to use a new one which will be P46 and this is indeed what happens. And so on and so forth. We can go through the rest of it here. And now we’ve got this whole rewritten set of micro operations that are phrased purely in terms of physical registers only. And we’ve sort of encoded the dependency between instruction values by if you follow the P45 in here, it’s used here and then it’s done with. So, you know, that has helped us. So, this is what that loop looks like or rather loop iteration and a half. If we didn’t have a rename, this is what it would look like. So, the first iteration of the loop is from here to about here and you can see that the next iteration of the loop can’t even start because, well, I can’t read through RDI until I’ve finished adding to RDI from the previous iteration, right? Because it’s like I have to wait for it. And it couldn’t start until we’d finished reading the memory address of RDI because presumably I still need the RDI register. If we were to sort of show you the zoomed out view, it’s 10 cycles an iteration. Doesn’t seem totally unreasonable. Obviously you’d use SIMD if you were actually trying to do this for real. If we rename it on the other hand, it looks like this. So, on the first cycle, we’ve already queued five of our instructions and we’ve issued four of them. They the first and the fourth one can already complete on cycle six because this add can be running concurrently with the load as they don’t depend on each other anymore. And you can see every time there’s an E, there’s usually a D underneath it because the E means it’s finished executing and the results become ready and immediately can be used in the instruction that depends upon it. And this is evidence for the broadcasting theory where it completes and just goes anyone who cares, P45 is now 27 and oh, cool, I got it and then he doesn’t have to wait another cycle. So, if we zoom out, it’s like one and a half cycles an iteration. It really makes a difference. It’s super cool. All right, let’s talk about my favorite instructions. XOR EAX, EAX. You’ve all seen it, right? Why the hell does it do that? It sets it to zero, but why not just do MOV EAX, zero? It’s a smaller instruction. But wait, there’s more. XOR EAX, so exclusive oring a number with itself always leads to zero, right? And because of the stupid syntax of X86, this is EAX XOR equals EAX, right? So, it ends up with zero, fine. Compiler loves to do it, but the cool thing is the CPU knows that you love to do this as well and so it goes, if you’re setting a value to zero, I don’t need to do anything at all. All I’m going to do is I’m going to arrange to have a physical register that’s always the value zero and I will just rename it. Okay, RAX is now P00 and any kind of thing that reads from it later on is just going to get the value zero, right? We’re done here. Doesn’t issue at all. So, although it’s written to the reorder buffer, it doesn’t go to any execution unit. It doesn’t take up any resources at all. It’s magic. We can also do some oneing idioms. There are some parallel oneing type things that can generate an all one. So, presumably there’s some other magical register that happens here. We can also do move elimination because if you think about it, if I move RBX into RAX, all I’ve done now is I’ve just got two names for the same register. So, we just have to update the RAT and say, well, RAX is now P88 which is incidentally where RBX is at the moment. There’s no issue. This doesn’t mean there’s no problem because there’s a problem. There’s no issue. There’s nothing written. There’s no actual work to do. There’s no actual ALU unit needs to be done. But there is an issue because this actually does introduce a complication. It’s surprising. There are limits, right? Because now we’re allowing a single physical register to be aliased by more than one architectural register which means that suddenly we either have to do reference counting to determine when we can free that thing back up again or we need another trick to determine when everyone’s finished with that particular physical register. What it seems to use is it has four slots for aliases and once one of those slots has been used by having two registers point at the same thing, until all of those registers have been overwritten, that P88 is now locked. It can’t be freed. And also that slot can’t be used by anything more. So, you can do four moves in a row and they’re all free and then after that it becomes really expensive to move between registers until you overwrite both the source and the destination that both alias. That has gotten better and I think they now use a bit mask to track things in a much more clever way. Because on Alder Lake and onwards, they have another trick. Because adds and subtracts are so straightforward, what if the renamer goes, well, hang on, an add with one, why don’t I just rename it as RAX is P88 and just remember it’s got a one added to it? How cool. So, again, small increments and adds don’t take up any resources at all. They just get written as sort of a housekeeping issue inside the RAT. We have to pay the price at some point for two reasons. One, the range is only between minus 1024, so it’s got 11 bits of range. And then if it hits that limit, it actually does issue the micro operation that says, oh, fine, add the thing and then we know where we are. And then of course after that it can be renamed from the new value. So, it’s pretty cool. There’s also interestingly, if you think about what this really means, when P88 becomes ready if it were depending on it, anything that depends on P88 has been given the burden of having to now actually do this add. And so, many of the instructions, that’s fine. You just it’s like, okay, I’m reading from P88 and there’s an adder that kind of runs in parallel and then by the time the result gets to me, it’s already had that offset applied. But the shift, like logical shift, for whatever reason, it takes a cycle to do it. Every time if you do MOV RAX, 10 and a shift, you’ll measure the shift takes one cycle. If you do MOV RAX, zero, INC RAX and then you do the shift, the shift now takes two cycles because it pays for doing the add in a single cycle beforehand. And we don’t know why it does that, but the guess that we have is that for things like variable shifts where you’re shifting by the value of another register, because the barrel shifter needs a lot of setup time and it’s a lot of circuitry to do that kind of 64-bit shift arbitrarily, we figured that it probably needs the results a lot earlier in the cycle and so it hasn’t got time to do the add and then set up this thing. And then rather than them just blanket the shifts that have variable shifts, they just said all shifts. Or it’s a mistake. So, this is the summary. This is where we are. We’ve got a predicted program order which gives us a sequence of instruction pointers which are then fetched. They’re decoded into micro-operations. Those micro-operations are renamed and then the reorder buffer is written to in program order and then these sort of micro-operations that are pending execution get written into the scheduler. Right. Okay. Back end looks really simple in comparison, but it does hide the thing we care about, actually doing the work. Everything else is just sort of preamble. So, reservation station. The scheduler and the reservation station are two names that Intel use interchangeably. That’s where we actually store the things that have yet to be processed. And from this nice pipeline to a big soup of things that can be done at some point. The two schedulers, reservation stations that have their entries that are just sat there waiting for their operands to be ready and for an execution unit that they’ve been tagged to go to to become free. But again, the queues are not really any order. It’s just whatever’s ready. We’ve got the reorder buffer and the write back for the reorder buffer is just to say this instruction is now done. It doesn’t have to be any other values. Way back when in sort of like Sandy Bridge, the reorder buffer actually contained the physical registers. Effectively, it just used the reorder buffer as the physical register store. And so the results of everything were written back into the ROB, but that’s no longer the case. They live separately because we added vector operations. And if we had 512-bit wide slots up there, it’s a lot of waste. There are various ports attached to these reservation stations. And broadly, the big picture thing is every clock cycle, the scheduler says which of these micro-operations is ready to run and issues them to one of these ports that has many actual logical units. And then those are pipelined themselves, so they have their own pipeline, but they’re fixed length and they come out. And when the result comes out, five cycles later, three cycles later, whatever, it comes back into this write back bus and it sort of says, yes, P47 is now 127. It’s definitely written to the physical register file. Maybe it’s written to and snooped on by some of the other entries that are waiting for that result. And they become ready to run. And then the whole process just keeps going. Looking at the ports that we have, this again is Skylake. There are three units that can do integer and vector operations on simple integers. There’s two that can do the shift, one that does the permute, some string operations and whatever. The execution ports over here are doing all arithmetic type stuff except for the store unit, which is weird that the store unit doesn’t live in the load store unit. Over on this side, a bunch of load ports that can do works and they can also generate the address of the store. And then one really simple unit that can only do simple addresses. So, the address generation and store load unit generates addresses and does the loads. We’re going to have to introduce something called the MOB, the memory order buffer. This is probably the most complicated part and there’s going to be necessarily some simplifications here. There are three kind of parts to this. There’s a load buffer and a store buffer. And there are some predictors which are absolutely bonkers. The store buffer holds all of the stores that have not yet gone out to the real memory. Now remember, everything we’re doing is speculative at this point. Until the instruction retires, we haven’t got cast iron guarantee that something upstream of it that happened in program order before it hasn’t gone, “Oh, I just divided by zero” or “I mispredicted a branch” or anything like that. So, until it gets to that retirement unit, we can’t let the store go out to real memory. So, we have to hold it in this sort of holding pattern. But of course, if I store to memory and then the next instruction reads from memory, I might reasonably expect to see the number I just stored in memory. So, I have to do something about this. So, the store buffer has many hats. One of them is just to hold the stores until they go out to memory. The other one is to kind of handle the loads that have come after it that might need to read from the things that haven’t yet committed. So, it’s broadly broken into these two parts where we have the address calculation and the data calculation separately from each other. And this is why the data unit lives in the ALU because where do the values that are going to be stored come from? Probably the ALU. And the address is generated by the address gen. What we’re doing here is that by separating them, some of the time, you know what address you’re going to be storing to earlier than the data that’s going to be stored to it. Like, you know, you square root something 20 times in a row. So, you got this huge long dependency chain of square roots and then you store it to address 20. I know that it’s going to store to address 20 pretty early on. I’m going to have to wait through 500 cycles before those square roots have all completed. Why do I care about the address? Well, if I’ve got a load coming on and it’s not to address 20, I can do that load. So, I need the address more than I care about the data for the purposes of letting other instructions that are later in the flow get their results. Loading is really really really complicated, which is one of the other reasons why it’s slow. Not just, you know, we talk about L1 cache. It’s like, it’s three or four cycles to get to L1 cache. It’s going to do all this as well. We first do some prediction about the ambiguity. We have to wait for the address to be calculated. Hopefully, that’s quick. And then we have to check the store buffer. Now, if we get a hit in the store buffer and there are no intervening stores between the hit that we made and anything that hasn’t yet been resolved, then we know that no store earlier than this load can possibly affect the value other than the one we found. And we can go, “Hooray! We don’t even have to go to L1. This is like L0 cache. Here you go. Here’s your number.” If that’s not true, if we miss, as in we don’t find anything in there, but there are also no stores that haven’t been resolved earlier than us, we know there are no stores that are in the queue that have not yet been flushed out to memory, so we can just issue the load now. And then it will actually finally go to the L1 and start the whole process of reading from memory. And if not, if it was still in sort of ambiguous state where there are unresolved stores, we just have to wait. The other thing that the memory order buffer is involved in is all of those horrible locked instructions that you want to do. Fences, so store fences and load fences are like operations that will cause those things to drain out before any further instructions that can issue loads can happen. So, finally we get to talk about execution. It just does what the old CPUs used to do. It’s pipelined. There are machine traps. Interesting, this is where floating point assists happen. If whoever has ever hit the hideous performance cliff of denormals. To the best of my understanding, what happens here is that the floating point unit can’t handle numbers that are not 1. * 2 to the minus whatever. So, for very very very small numbers that are below float min, there has to be special casing for it. So, the hardware can’t do it, but it can’t determine that ahead of time. So, by the time it gets to this point the multiplier or the adder goes, “Oh, crap. Right. What do we do? We’ve got a number I can’t deal with.” It just has to essentially flush the pipeline. And then it goes back to the microcode sequencer and says, “Can you just generate me the code that can handle this, please?” Dynamically like undoing as a deoptimization kind of thing, like a JIT deopt. The execution ports, we hope that they’re balanced. The write-back stage is the results are broadcast, they’re written to the physical register file, and it’s marked complete in the reorder buffer. Retirement. All the retirement system is doing is that reorder buffer is essentially that ledger of all instructions that flow through the system. The retirement system is trying to pick up as many completed instructions as it possibly can and say, “Okay, these are now done.” And they can be freed back off. And that’s the other thing, it does the freeing. So, this is where the registers that were renamed are now freed back to the system as they’ve been all the way through the system. But of course, it’s not that the register that just got written to that we’re freeing, it’s whatever it used to be when it was renamed right back at the beginning of the pipeline. So, we know that when we complete the register that replaces EAX completely and it used to be P88 and it’s now P50, at the point that it retires P88 is now free. So, that’s when it gets chained back onto the list. This is where exceptions actually get handled. We talked about this as just a flag and then that’s when the whole thing goes horribly wrong. Hopefully you’re not doing too many exceptions in your code. And not exceptions like C++ exceptions, I mean exceptions like machine checks and divide by zero and memory stuff. The stores are also marked as senior in that memory order buffer because they’re going to continue to live in there until they actually get flushed out to the real memory. And the retire system can do four micro-ops a cycle, which is very rarely a limitation. And then the last stage is just the store buffer retiring stores in order as well. As they become senior, that means they can go out. And then some magic happens, something something total store order, some kind of liaison with the other chips on the die and all the other subsystems. And this is where the RFO, the read for ownership, that actually is the kind of thing when we talk about cache ping-ponging and false sharing, this seems to be when it happens. Well, that’s it. It’s pretty straightforward, right? And to think that I was going to talk to you about branch predictors as well. But the kind of advice is do things simply and straightforwardly. Don’t do divides. Integer divides, that is. Floating point divides fine. Small aligned loops are good. Let the renamer do its work. [Q&A section followed covering topics including: physical register widths vary by type (flags, AVX, general purpose), how researchers like Agner Fog and Travis Downs use performance counters and careful experiments to reverse-engineer these internals, the bit-5 branch predictor discovery that enabled Spectre mitigations, and Intel’s reasons for keeping architecture details secret (Hyrum’s Law, competitive advantage, documentation burden).]