heading · body

Transcript

Episode 102 Ai Changed The Network Inside The Ethernet Fabric Powering Ai Infrastructure

read summary →

TITLE: Episode 102: AI Changed the Network: Inside the Ethernet Fabric Powering AI Infrastructure CHANNEL: Built for Trust DATE: URL: https://youtu.be/v9rMYvQaPUY ---TRANSCRIPT--- With the introduction of AI training, distributed inference, it’s just the need for bandwidth has skyrocketed. Assan, welcome. So great to have you here.

Thank you very much, Nick. Great to be here. Ah, great. Thanks. Uh, and actually in thanks for helping us geek out uh in the morning. you know that was a great uh great keynote and actually I really enjoyed it because it’s like the whole Ethernet space has you know not only evolved but it’s accelerated at a pace that we haven’t seen before and um you know we started off with 10 megabits back in 1983 and it wasn’t until 2010 that we hit 100 gig and like in the last 14 years we essentially have gone from gigabits to terabits. So like right now you’re shipping like the chips with 800 gig you know on them and then you’re now and actually like significant like density as well on 800 gigs. Um and then you also have obviously in the planning you know 1.6 and 3.2 terabs. So um let’s first start talking about maybe that journey you know around Ethernet like that’s been in our industry for like three decades you know or so and um and how it’s now accelerated really thanks to like kind of the hyperscalers and u and also the public AI companies as well. Yeah, Nick. Um, you know, thank you for having us here. And you’re absolutely right, right? Um, if you go back just to the data centers four, five years ago, you would think like 800 gig, when is it going to happen? 1.6 terab it’s going to hap when it’s going to happen. And if you look at the enterprise data centers, most of them even today are deploying like 25 gig, 50 gig nicks. Yeah. Right. That is the connectivity that you have. But with the introduction of AI training distributed inference, it’s just the need for bandwidth has skyrocketed. Yeah. And the need for bandwidth is is is there is this use case which is the backend use case. That’s where it started off which we call the scale out. And you can see the nick speeds there with the GPUs are 400 gig and 800 gig today and moving to 1.6 terabit. The Tomahawk 6 that we talked about this morning is actually supports 1.6 terab ports and it’s shipping and it’s in production. But even more mindboggling is what is happening on the scaleup domain. Right. Yeah. Um scale up is where these GPUs are directly accessing each other’s memory. Yeah. And you know they have enormous amount amount of memory. We are talking about 40 terabytes of memory bandwidth. 40 terabytes of memory bandwidth today moving to 100 terabytes in the next generation GPUs. So you’re talking about around 10 terabs of networking bandwidth that is required 25 terabs tomorrow, right? Wow. And so it’s it’s just um it’s everything is accelerating at an extremely fast pace. This is driving some of the road maps on the Ethernet side. Right. We just released the 100 TB switch. Yeah. And you know 200 terab, 400 terab, we’re working on it, right? They’re around the corner as we move forward. Yeah. Yeah. Those are amazing numbers. So like I think I want to like have this conversation around the high-end. Uh but then I want to kind of relate it back to the large enterprise marketplace because we’re really we’re talking about kind of the the the scale of scale like you know the higher the highest end of the marketplace. So um in that you know so networking is really different you know at you know 800 gig you know in the terabit range. So it’s not just about connectivity anymore, but really we’re, you know, it’s um it’s a fabric that’s really kind of connecting GPU memories together. And so we have compute and memory so locked into the network where there’s synchronization, load balancing, um the ability to kind of like support these really high burst rates, um and these really kind of different traffic patterns. So maybe if let’s describe a little bit or you can describe a little bit about like how networking and the function of networking has changed uh as we’re into like this kind of realm or this scale now. Yeah. So the role of networking and if you really want to just talk about this I think the scale and where networking comes into play is the use case in the space right and there is a use case for what we call scale up right this is where you want to connect most of these GPUs directly accessing just like you said the memories directly and in that particular space actually bandwidth like I just described is extremely extremely important But the network if you think about it is actually simpler, right? Because what you want is you want these GPUs to be connected just one hop away. Uh because that’s how you get the best performance, right, for the models if you’re doing training or you’re doing inference. And so the key thing over here is the address space that you need is not very large. You need you do not need very very large Ethernet headers. So this is where this concept of optimized headers has come into play. All right, there’s discussion around Ultra Eat and the consortium on this and um so you can optimize the header so the address space is small and and you’re not burdening right um the fabric. Yeah. The other thing that’s very important in this space is what I call uh reliability, right? Because these are memory transactions, right? And you know it’s extremely important that uh reliability is of of extremely high important in this case but as you now start to scale out and you’re talking about large distributed inference or training the requirements are slightly different because like you said the AI workload is fundamentally very different. Yeah. It’s you know the traditional uh network that we have the data center that we have we have the TCP IP traffic it load balances itself very effectively across different parts in the case of AI you have what we call elephant flows these are very large flows very few of them think of this as an eight lane highway if everybody is driving in one lane you’re going to have congestion so how do you solve this load balancing problem how do you solve the congestion problem And the other very important problem is how do you recover very fast from link failures because link flaps are a huge huge deal whether it’s in in any network. Yeah. Right. And if you have a link flap which do happen um what you’re going to have is if you’re doing training you’re going to go back to a checkpointed state start the job over your job completion time is going to be higher. So these are some of the problems that need to be addressed. The third problem that networking has to address is a lot of these data centers um the power is a problem right at some point of time you’re going to run out of power you’re going to run out of capacity and if you want to extend these clusters across data centers how do you do that right how do you keep this fabric lossless how do you ensure that you have uh security because now you’re going outside your data center and also um how do you go very long distances without compromising right on on the performance. Yeah, you we do have the speed of light problem you know there but yeah no continue. Yeah. Yeah. Yeah. No, so those are the three different use cases like you uh you know asked about and those are the use cases that the industry is trying to address right and different requirements. Uh but the good thing is all of this can be done over Ethernet right this is only one technology that if you’re trying to do the scale up scale out or scale across that allows you to do this end to end. Yeah. Yeah. No. Yeah, I’m with you on there. And um and obviously there’s Infiniban and that’s been like, you know, um serving really well in these large environments, you know, as well. Uh but there’s a much larger community around the Ethernet space and there’s um a large culture uh of innovation there as well. And to be able to kind of grow it as fast as the market really needs it. And I think also what’s what’s occurred is an exp as these speeds increase um and as these chips that you have like right in front of us you know uh are getting developed the optics market has just exploded just gone crazy you know um so um you know I guess maybe where’s your thinking around uh where the EO conversion kind of happens right it used to be kind of in the modules you know that you’re plugging in um but it’s getting closer and closer now actually into the interface on the chips. So can we speak a little bit about or talk a little bit about you know how kind of electron to optics and optics to or photons and photons to electronics you know occur where very good question and absolutely you really can’t talk about networking without optics today right they are now just tied at the hip right you these conversations have to happen together and kind of like I’ll again go back to the use cases that I talked about right because the optics require requirements are evolving in all of these different use cases. Let’s start with scale up once again. Scale up is really confined to a rack today. This is where all of the XPUs or GPUs are connected in a rack and dominantly they are connected over copper. I mean copper has been fantastic, right? It it has low power. It’s cheap and it is very very reliable. But what is happening is to get the best performance for inference or for training, people want to increase this domain size. You want to go to domain sizes which are more than 100 because right now you’re confined to a less than 100 today. 256, 512, maybe a,000 in the next 2 to 3 years. Yeah. But if you want to do that, you don’t have the reach with copper at some point of time. Yeah. Right. because there is the sis 100 gig, 200 gig, 400 gig as the si speeds increases the reach for copper is going to go between that. Yeah. Yeah. The distance cranks. So the question really is how you solve this problem and this is where um we worked together with the industry right with Meta with Microsoft with OpenAI with Nvidia and AMD and released this MSA called the OCI MSA right OCI and this is really an open interoperable physical layer specification for optical scaleup. Mhm. So essentially the OCI technology will give you the a reach of up to 2 km but without compromising on the benefits that you have with copper right which is power which is cost and which is reliability. Yeah and 2 km is big is big enough for a data center that’s big enough for these scale up domains absolutely right as we look over the next 3 to 5 years. Now you talked about phetonics right that’s the um on the scale out side of things what’s happening is u we talked about this Tomahawk based system right you know um 128 ports of 800 gig or 64 ports of 1.6 6 terabs and this is going to go to 200 and 400. The systems are becoming increasingly complex. It’s very very difficult to cool. The thermals are a challenge. Yeah. So that’s where the optics come in. The importance of uh optics come in and you know we have been investing at Broadcom for the last 5 years in this concept of hope package optics. Uh we are on a third generation. We released this thing called Tomahawk 6D Davidson. Mhm. Um, it’s the 100 TB Tom Hawk 6, but it also has 100 terabs worth of optical engines in there. Uhhuh. The lasers will be outside, but outside of the chip, uh, outside, right? They’ll be on the front panel. Y So, but the question really is the power benefits, I think they’re well understood in the industry, right? There are others who are working on CPO solutions. You know, you get 70% power savings when you go to CPO. But the big question has always been around reliability. Yeah. Because it’s a big risk. Now you’re integrating uh optical engines onto the silicon itself. Yeah. What if failure happens, right? Are you going to replace the entire switch? Yeah. Right. Yeah. And this is where a lot of we’ve worked with a lot of partners on doing very extensive testing. Meta has been on the forefront of this and they this did this testing for co- package optics with us and they were able to get what these 1 million link flap free devices are and so the link flaps are that important. Yeah. Well yeah because like right you know you have a link flap and then you’re restarting that job. Exactly. You know from the very beginning. Yeah. So it’s Yeah. So go ahead continue. And similarly um now they’re able to get up to 50 million link flap free device hours. Okay. So you’re getting reliability which is almost 10x of what you can get with pluggable optics, right? Um so core package optics is absolutely something that um is starting to happen in the industry and it’s becoming the need but you know there are other initiatives out there right Arista came up this concept of XPO right where you can actually have um different optical uh choices and you can in in a 200 terabit you can fit in one RU oh um so I think what you’ll probably see is a lot of work happening uh um in conjunction between what’s happening in network and optics. A lot of this integration and a lot of customers will not want to have a networking discussion without an optics discussion. Yeah, I think so. Hey, so can we take a look at the chipsets that you brought? Uh absolutely. Absolutely. So I think uh I brought a few things with me, right? This is uh this is the what we call the tomahawk uh uh ultra right. I mean 4 years ago you talked about um infiniband and infiniband does a very good job in terms of low latency like for Ethernet it was like one of the problems or one of the uh key things that we wanted to solve was how do you get very low latency at very high throughput and this is what this solved. It gives you 51.2 2 terabs of performance at 250 uh nancond latency, 77 billion packets per second of performance. Great for scale up. Um and this is the Tomahawk 6, right? This is the only 100 terab switch out there in the industry, right? That’s there in production. Um so how many ports of 100 terab is this? You can actually get 128 ports of 800 gig or 64 ports of 1.6 terab. Ah and I think um on the show floor you can see like some of the vendors already have systems right um which are built with Tomahawk 6. This is actually running in production today right um and shipping in production volume um and you can actually build 128,000 XPU cluster in a two-tier topology with this. So uh which um the OEMs kind of use that chipset? So what you’ll find Nick is we will have our um for enterprises a lot of customers use Arista right from us Juniper Lokia all of them will have Tomahawk sixbased systems and there is an ODM ecosystem you know you saw like for example Celestica edge core micers networks and all of them have systems right based on Tomahawk sixes that are available today right so there’s a very large ecosystem that is building on top of it and then lastly but not not lastly but This is the Jericho 4 or a variation of this that we call Kuman 4D. Right? You can see there are this is high bandwidth memory on here. And the reason why we have this high bandwidth memory is because when you’re scaling across and you’re going very long distances, um you need the buffers. Oh, so that’s that’s the buffer allocation right there. So this is like essentially a 51.2 TB router. You can get 64 ports of 800 gig in a two form factor. How many ports of 800 gig? 64 ports. 64. It’s a It’s a complete router with very large routing tables with deep buffers. Gives you line rate encryption. Um right IPS or MAXSE and you can go with ZR optics distances which are 100 kilometers or above with this. Yeah. And um so this small guy over here, right? This is what we call the Thor Ultra, right? This is the world’s first 800 gig nick. And we talked about uh you know the nicks in you know moving to 800 gig and 400 gig. So this is the 800 gig nick. It um gives you 800 gig of performance but it also has capabilities which are there which were discussed in ultra consortium to modernize RDMA as a protocol right to support multipaththing out of order data placement selective retransmit modern congestion control and this will be shipping in production in uh in July of this year. Okay, awesome. So like if I’m looking at like these chips here, so the optics then would be connected kind of like out this part uh all basically all around the chip set. Right. So you’re not bringing the optics into the chip. Uh it’s really on the board, you know, and that’s where the LEDs and the receivers are as well. Correct. Right. There are two ways, right? When co- package optics actually you can actually integrate the optics onto the Tom Hawk 6, right? There will be a different package that you have. So what do you mean when it will be integrated right onto it like like on the chip? Yeah. So they will be on the sides here right you know uh and but the lasers like I said before are going to be outside right they will uh when you’re building the systems the lasers are going to be on the front panel because lasers are one thing that that probably have a pro higher probability of failure and they remain outside. Really? Okay. I I didn’t know that. So the lasers actually have a higher probability of failure than than the silicon. Correct. Correct. Wow. Yeah. Wow. Okay. Um, go figure that. Uh, and everything I’m assuming is mostly single mode fiber that’s kind of connecting, you know, into these systems so that you can get the data rates up that high. Correct. Right. You know, it it depends on again the use case. You have DR4, DR8, people are using um FR, you know, as you go with long distances, you’re using ZR kind of optics. Yeah. So and u and when you are just connecting scale up it’s copper but some people will use AEC right you know so it just depends on the reach that you’re looking for. Yeah right and you know you have different flavors that are available with these. Okay. All right. So um a little um uh Overton window expansion question here. Um if this stuff was going to be on like um an AI data center in space um how would we connect those GPUs? Well, it’s it’s going to be are those going to be direct laser to laser through, you know, vacuum of space or is that actually going to be switches? Well, if you’re building this data center connectivity, um the question is the connectivity. You’ll have these racks. So, you will have this the scale up domains and probably the scale out domains. Same thing, huh? Same thing, right? That’s how you’ll have achieve the the connectivity. Now, the space and earth, how does it work? I think there are some things that need to be figured out. Yeah, that’s right. Yeah. Yeah. Well, yeah. Getting all that data from space back down. Yeah. And Oh, okay. All right. Awesome. Um, well, this is like really really great. Thanks so much for bringing bringing the props. You know, one, I love looking at hardware, you know, too. And uh and those are like super uh super impressive. So, all right. So, now I want to kind of like lead us a little bit into like the enterprise marketplace. So obviously most of the action for um for AI has been hyperscalers and open AI and anthropic um well anthropic has been kind of renting you know um you know compute um but um meta and others. So Tom Gillis had a really good kind of presentation too right before yours and he was talking about how um both the ecosystem of like the enterprise marketplace is starting to work on smaller you know more compact um um AI stacks for the large enterprise. So much more power efficient you know not as dense around the GPUs around around need. So how do you kind of like how do you fit into that market segment you know as well or are you working with like the other OEMs on how do we now package or repackage this technology so it’s more um accessible and affordable uh for large enterprise Nick I think the way that we build this is we have these small building blocks y right and just depending on the scale that you want to achieve you can scale accordingly, right? You’re absolutely right. For the enterprise, you probably don’t need cluster sizes which are most of them probably don’t need cluster sizes which are 100,000 XPs. Yeah, for sure. But what you probably need is maybe a few hundred, maybe just a few. Uh but even these products, right, you know, you’re not now talking about if you’re using a Tomahawk 5 or a Tomahawk 6, you don’t need hundreds or thousands of these switches. a single switch actually can can support your entire domain, right? As we talked about, you can get 128 ports of 800 gig. Um, if you use a Tomahawk 5, you can get 128 ports of 400 gig. Yeah. So um your entire fabric could be a single switch in in in so um this is what we have been able to do that you know these systems they used to be 14 16 RU right they consume 20,000 W of power and they cost a million dollars and something like this like a Tomox is now shrunk it down to 2 RU 17th the form factor it gives you eight times the performance it is one10enth the power And it is less than onetenth the cost. Yeah. So it’s it’s really congratulations on those metrics. That’s um that’s an impressive set of metrics. Absolutely. There are enterprises who are doing this today and for them the entire fabric would be a couple of switches as they start off and as the requirements grow you can continue to scale out and grow bigger right with this uh these form factors. Yeah. Awesome. Yeah. Hassan thank you so very much. It’s um always such a pleasure to have you and um and actually just uh kind of get into your world, you know, and to like, you know, see what you’ve been uh kind of building and working on over at ProCom because you are the engine that kind of drives a lot of the products that we see out in the marketplace. So, um thanks for coming on. No, thank you for having me on the show and congratulations. Right. I see the show continues to grow and I think the next few years this this AI revolution is going to be hitting the enterprise even bigger and we look forward to working with you moving forward. Awesome. Great. Thanks Husan. Thank you. Thank you. Thanks everyone.