heading · body

Transcript

Why Next Gen Ai Scale Up Needs Cpo

read summary →

TITLE: Why next-gen AI scale-up needs CPO CHANNEL: SemiAnalysis DATE: 2026-06-03 ---TRANSCRIPT--- Semiconductors run on copper, from the tiniest metal layers that connect individual transistors and the traces that run through your motherboard, all the way to the massive spine that allows 72 GPUs inside a single Nvidia NVLink 72 rack to communicate with each other. All of them use copper. Without copper, microchips wouldn’t work. But, copper has a limit, and a very short one for that. Your internet cable might be copper inside your house or to the curb, but that’s about the maximum if you want really fast internet. Beyond that, it’s all fiber optics and lasers. The same is true for data centers. But here, the demand for ultra-high bandwidth limits the reach of copper even more. The faster the speed that data is transmitted over a copper channel, the shorter the distance the transmission can reach. 2 m is about the maximum for speeds of 200 gigabits per second per lane, which is the speed that the latest AI chips communicate at. That’s not a lot of reach. Beyond that distance, copper cannot support the immense bandwidth needs of modern AI servers. So, why use copper in the first place if it’s so limited? Optical must be better. How can you beat a laser? Well, there’s a reason why they say use copper when you can and optical when you must. And that’s exactly what we’ll figure out in this video. We will talk about the advantages and limits of copper, explore the current state of pluggable transceivers, and explain the future of co-packaged optics. To understand copper versus optics, we have to understand the different networking tiers a modern AI server is using. To start, there’s the front-end network. That’s what every server has been using long before AI became a thing. It’s used for loading data, SSH access, and user requests. Where it gets interesting is the scale-up and scale-out networks. First, there’s the scale-up network. Scale-up connects all the compute and networking trays inside a single rack. It’s basically rack internal communication. The scale-up network requires extremely high bandwidth and super low latency because it’s used to connect multiple GPUs or other AI accelerators in such a way that they behave almost like a single GPU. That means they have to be able to communicate with each other in an instant and share extremely large amount of data. The most famous example for a scale-up network is probably Nvidia’s NVLink inside the NVL72 rack. Scale-up networks are copper-based and because the networking is limited to a single or maybe a double wide rack, we’re talking about distances of up to 2 m, well within the domain of copper. Second, there’s the scale-out network. Scale-out handles the networking that goes outside the rack. It basically covers the entire data center and connects all the individual racks and servers with each other. Scale-out is rack-to-rack or, to be more precise, server-to-server communication. Scale-out isn’t trying to turn multiple chips into a single one, at least not in the extreme way that scale-up does. So, it doesn’t have the same extreme bandwidth and latency requirements, but you still want it to be as fast as possible. And because a modern AI data center contains a lot of racks spread out over a pretty large area, the scale-out network also has to cover a pretty large area. One rack to the next one might still be in reach of copper, but the next row of racks certainly isn’t. That’s why scale-out networks have to be optical. And just to put the different bandwidth requirements into perspective, the scale-out network needs eight to 10 x more than the front end, and the scale up network uses 10 x again that of the scale out. So, now we know that generally scale up is rack internal and copper based, and scale out is rack to rack using optics. But, how does that help us and where do co-packaged optics come in? I’m so glad you asked. Optical networking isn’t new by any definition. No matter if ultra long-range fiber cables that literally cross entire oceans or optical networks inside data centers. They’ve been around for a while and are tried and tested technology. The most common form of optical interconnects used in data centers are so-called pluggable transceivers. They are called pluggable because, well, they’re plugged right into the back of a server tray, and they’re called transceivers because they both transmit and receive signals. If you have been inside a data center, there’s a good chance you have seen one of those before. Pluggable transceivers are not only a proven and widely used technology, they are also standardized. The latest and most common ones are OSFP and QSFP-DD. That’s why they look the same. And there are a lot of companies competing in the pluggable transceiver space. Lots of different choices. A standard pluggable transceiver contains four main components. First, the standardized physical connector that plugs electrically into the server tray interface. Second, a DSP, short for digital signal processor. The DSP has the very important function of boosting and cleaning the incoming electrical signal before it’s translated into the outgoing optical signal and vice versa. Component number three is the transmitter, also called transmitter optical sub-assembly or TOSA, and includes the laser plus the modulation function. And number four is the receiver optical sub-assembly or ROSA, a sensor that catches the incoming light signals. If you had to guess which of these four components is using the most amount of energy, what would you guess be? The interface, the DSP, the transmitter laser, or the receiver sensor? I’m pretty sure most of you would have answered the same as I would. Obviously, the laser. I mean, it’s a freaking laser, right? But no, the laser only uses about 15% of the typical power of a pluggable transceiver. Only a few more percent than the receiver sensor. The vast majority of the energy is consumed by the digital signal processor. And with vast majority, I really mean up to 60%. But that’s not all. Every system, every interconnect always adds some kind of latency into a network. A pluggable transceiver usually takes about 150 to 200 nanoseconds to translate the signal from electrical to optical. I’m not going to let you guess again because it’s even more extreme. Over 90% of the entire latency delay is because of the DSP. Looking at all the components inside a pluggable transceiver, the digital signal processor is responsible for 60% or more of the entire energy consumption and 90% or more of the entire added latency. And that’s where co-packaged optics come in. The entire reason CPO even exists is to eliminate the need for a DSP in an optical transceiver. And how do you do that? By placing the optical engine closer to the source, which means closer to the networking switch or even the GPU. The reason a pluggable transceiver needs a DSP in the first place is because until the signal reaches the transceiver and is translated into an optical signal, it’s an electrical signal traveling over copper. The signal originates from the GPU or the switch, travels through the metal layers of the silicon onto the package, from there to the motherboard, and then finally to the transceiver plugged into the very end of the server tray. We’re talking about up to 30 cm here. And while it doesn’t sound like much, remember that for high-speed interconnects, copper caps out at 2 m or less. So, 30 cm is a lot for copper. Can it handle it? For sure. That’s how pluggable transceivers have worked for many years. But if you want to translate a signal from electrical to optical, you need a super loud and clean signal. 30 cm of copper are enough to degrade a modern high-speed signal to a point where you need a DSP to boost and clean up the signal. And that takes time and energy, hence the power and latency penalty. That’s what CPO is trying to eliminate. Get so close to the source that you don’t need a DSP anymore. There are different ways to go about it. One interesting approach is called LPO or linear pluggable optics. The idea is a mix of crazy and genius. It’s still a standard pluggable transceiver, but without a DSP. If you’re confused right now, I get it. Isn’t a DSP required to clean up the signal? Yes, it is. But LPO basically says, “Screw it.” Takes the still distorted electrical signal, translates that into an optical signal, and hopes for the best. And you know what? It actually works. But only for a much shorter distance than a typical optical network, because sending an already distorted signal drastically decreases the reach. The first attempt towards actual CPO was called on-board optics or OBO. The idea is simple, move the optical transceiver closer to the signal source to reduce DSP requirements. A good start, but it never got enough interest because it wasn’t close enough to get rid of the DSP entirely, and at the same time, it lost the easy access and exchangeability that pluggable transceivers offered. It was a good idea, but ultimately combined the worst of both worlds. More complex integration and less repairability while still needing a DSP. And then there’s NPO or near package optics. NPO moves the optical transceiver even closer to the ASIC, usually on a special high performance substrate. Much closer than OBO, NPO can be seen as a intermediate step toward CPO and is actually being deployed right now. In the end, the question is, if you are using NPO, why not go all the way? And all the way is only co-packaged optics. It’s already in the name. Co-packaged as in on the same package. That’s super close. AMD’s Infinity Fabric on package, for example, it’s already in the name, also uses on package interconnects. That’s how Zen 1 to Zen 5 scales and connects compute and IO dies. But we can get closer. Let’s talk about CPO tiers. The first tier is the minimum, what we just talked about. The optical engine is placed on the same package as the switch. Both are connected via copper traces that run over the shared packaging substrate. It’s close enough to get rid of the DSP entirely. But on package still requires a high-speed interconnect that runs over the package to connect the ASIC with the optical engine, which means you need SERDES that translate the electrical signals from parallel into serial and back again. The second CPO tier is introducing an interposer. This interposer can be silicon based or organic and the ASIC and the optical engine are sitting on the same interposer. They’re still packaged together, but the interconnects aren’t routed via the package substrate, but via the interposer. And because a interposer allows a much higher interconnect density, this design doesn’t require SERDES anymore. ASIC and optical engine are connected via a wide fabric that allows for fully parallel integration. This is the final boss of CPO. Placing the optical engine so close to the ASIC, it not only doesn’t require a DSP anymore, but also gets rid of 30s entirely. Using more advanced packaging technologies like hybrid bonding, for example, could in theory result in an even closer and lower power integration, no matter if true 3D stacking with the optical engine above or below the ASIC, or 2.5D stacking like TSMC’s latest SOIC-MH. But once the optical engine is so closely packaged that it doesn’t require 30s anymore, you have truly mastered CPO. But there’s one more thing. Did you notice how we’ve talked about a networking switch or a GPU or other ASICs? What we’re seeing right now with Nvidia’s Quantum and Spectrum X chips, for example, is CPO for the networking switch. But the final destination isn’t the switch, it’s the GPU or AI ASIC. Co-packaging the optical engine with the GPU instead of the switch isn’t a different tier, but an entirely different level. Now that we know basically everything about CPO, let’s talk about implementation. At the beginning, we talked about use copper when you can and optical when you must. How does that relate to CPO and where will CPO be used first? The first target of CPO is actually scale out network. It’s to replace the pluggable transceivers. And with all the hyper-bound CPO, it seems like an easy choice. But interestingly, it isn’t. Scale out networks connect multiple racks in a data center. They have been optical for a while because they cover a distance that’s out of the reach of copper. As I said in the beginning, pluggable transceivers have been the standard for many years now. And we just learned how the only reason CPO exists in the first place is to get rid of the latency-inducing and power-hungry DSPs that are a necessity for pluggable transceivers. But, as always, everything comes with pros and cons. Yes, pluggable transceivers need a DSP. Yes, they use a lot of energy. And yes, they add quite a bit of latency into the network. But, they are also super easy to handle. They are literally pluggable. If one fails, you just change it out. Something any data center technician can do in no time. And then, you should never underestimate working with technology you know. Pluggable transceivers have been around for so long, every data center knows how they work and what their flaws are. Plus, because they are standardized and there are many suppliers, large data centers never have to worry about supply constraints or high prices. If your current suppliers get too expensive, another one will gladly step in. Because of that, moving to CPO isn’t as easy of a choice as it might seem. With CPO, you’re buying the optical transceiver as a part of the server hardware. It’s literally packaged right next to the switch or maybe even the GPU itself. Which means if you buy Nvidia hardware, you have to buy the Nvidia solution. And if you buy Broadcom, you obviously have to buy the Broadcom CPO solution. It also means if one optical interface fails, you have to replace the entire switch. Because you can’t repair something at packaging level. But, at the same time, you do get an already tested and working system that is more reliable overall. In the end, it all comes down to pain points. Where are the data center providers and the hyperscalers feeling the most amount of pain? What is most important to them? And when it comes to AI data centers, the most important aspect, the largest pain points, are energy and system utilization. When you spend billions of dollars on AI hardware, you have to make sure that it doesn’t sit idle. That means you want to reduce latency and increase bandwidth, even for the scale-out network. And because power is such a massive issue for AI data centers, you want to reduce energy consumption for everything that’s not AI compute to the absolute minimum. And these two areas are where CPO shines. So, AI data centers would seem to benefit a lot from CPO. And in fact, many are considering or already preparing a switch given that CPO promises better efficiency, reliability, and operational simplicity. Rack-to-rack communication has a lower latency, and the networking layers consume less energy. With these obvious benefits, the entire industry should be moving in unison, right? But looking at the industry right now shows an interesting deviation. Many hyperscalers seem to value repairability, and especially vendor diversity, with the important added factor of price control, more than a faster and more efficient scale-out network. Hyperscalers can see the technical benefits, but they also want to avoid a vendor lock-in at all costs. That’s why many hyperscalers are still cautious about fully committing to CPO. Because of this, there are even efforts to develop a NPO and pluggable transceiver hybrid. The entire industry around optical networking is moving quickly, and all possibilities are explored. NPO could become a real middle ground, or might just turn out to be an intermediate step towards CPO. New clouds, on the other hand, more keen on CPO. They like buying a turnkey solution, and the idea of an Nvidia CPO switch is super appealing to many of them. In any way, because scale-out always has been optical, the rest of the infrastructure is already there. So, scale-out networks in AI data centers are starting to switch to CPO. But what about scale-up? An entirely different beast, and still fully dominated by copper. But why copper? Because copper speaks the native language of semiconductors. It uses electrical signals. Every microchip works with electrical signals, starting with the smallest layers deep inside the chip itself. By using copper for the rack internal scale-up network, you don’t have to translate signals at all. Copper has a latency of about 5 ns per meter, and at distances of up to 2 m or less, we’re talking about maybe 10 ns of latency. Remember, the DSP inside a pluggable transceiver added about 150 to 200 ns of latency alone. Copper is fast because you don’t need to translate the signal. Copper is easy because you don’t need to translate the signal. And copper is tried and tested because it has been used since the very beginning of modern networking. And because there’s no translation done at all, it’s still faster than co-packaged optics. Yes, CPO gets rid of the DSP, and with that it removes the majority of the latency in legacy pluggable transceivers. But, there’s still added latency. And yes, CPO also uses constituently less energy to translate the signals from electrical to optical and back. But, even without a DSP, that’s more than what you need for copper. Because with copper, you don’t have to translate the signal at all. For scale-out networks, CPO has obvious advantages, even if there are some downsides. But, it’s much more difficult when it comes to scale up. What are the actual benefits of CPO over copper? The benefits of CPO for scale-up start where copper ends. And I mean that literally. Copper is a great material for high-speed interconnects, but like we said at the beginning, it has a very short limit. And we are getting awfully close to that limit. While current gen 224G copper interconnects still work with PAM4 and can reach up to 2 m, next gen 448G interconnects won’t have it that easy. PAM, or pulse amplitude modulation, is a technique that allows you to transmit two bits of data by using different voltage levels. But for faster interconnect speeds, PAM4 isn’t cutting it anymore. PAM6, or maybe even PAM8, will be needed. The problem is using higher levels of pulse amplitude modulation creates a more unstable signal, which further reduces the reach of copper. Suddenly, 2 m might be too much. At some point, the signal-to-noise ratio becomes too much of a challenge. Another angle is doubling the baud rate, basically the signaling speed. But in the end, it has the same problem as using higher PAM levels. Both approaches shrink the reach of copper. The argument for CPO for scale-up networks isn’t really CPO itself, but it’s the limits of copper. And that means as long as copper scales for rack internal communication, be it with PAM8 or beyond, copper will still be the go-to solution. That’s why Nvidia is still planning copper-based and reeling networks for Rubin, Feynman, and beyond. But copper can’t scale forever. The end is already in sight. Nvidia’s Blackwell generation was a massive leap for AI performance and efficiency. And why part of that was definitely due to the advanced GPU architecture, the real breakthrough was Nvidia’s NVLink 72 rack. Before Blackwell, the scale-up domain was limited to eight GPUs. The 9x increase to 72 GPUs was the real performance boon. Suddenly, 72 GPUs could act and work like a single GPU. That’s what unlocked the true performance advantage of Blackwell. And it’s exactly this concept where CPO will be able to show its true advantage. Once CPO is integrated at the GPU level, it will unlock massive scale-up domains that directly translates to a larger world size. And once you compare a CPU-based scale-up cluster with potentially thousands of GPUs to a copper-based scale-up that might be able to connect only a few hundred GPUs, the choice will be obvious. And that’s exactly what Jensen announced at GTC 2026, a mixed scale-up network that combines copper and optical to achieve a multiple rack world size. Vera Rubin Ultra NVL 576 will be the first system using this combined approach with a total of eight Rubin Ultra NVL 72 racks. Internally, each rack still uses a copper-based scale-up network, but this time the scale-up network also connects up to eight NVL 72 racks using optics. And because 8 * 72 is 576, Nvidia calls it 576. And the next-gen Nvidia Kyber NVL 1,152 is already on the horizon. At this point, it doesn’t matter that CPO adds a tiny bit of latency and uses more energy. The sheer scale of CPO scale-up will dominate everything. And because CPO offers a lot more scaling vectors than copper does, there’s plenty of future-proof network scaling left. It might be time to change the principle of use copper when you can and optics when you must to use copper as long as you can because the wall is approaching fast. Co-packaged optics aren’t the holy grail. Just like any other technology, they come with their own drawbacks and challenges. For scale-out networks, the advantages are already clearly visible, and we will see a steady adoption of CPO over the coming years. For scale-up, it’s a little bit more difficult. Copper still has a few tricks up its sleeve, and we will see every last bit pushed out of it. But make no no the end of scaling is in sight, and once copper has reached its literal limit, CPO will take over scale up in the blink of an eye. Because no one can outcompete a larger scale up world size that connects more GPUs. What we just covered in this video is just a tiny part of the Semianalysis deep dive into everything CPO. If you want to understand how CPO really works, I highly recommend checking it out. And for everyone that not only wants to know how CPO works, but where the entire networking industry is headed, including granular visibility into the hardware like switches, transceivers, cables, plus a top-down analysis of total market conditions and vendor market shares, the Semianalysis AI networking model offers industry-leading insights in the fast-paced market. As always, you can find the links in the video description below, and I hope I see you in the next one.