heading · body

Transcript

What Happens After A 1000000x Ai Compute Leap Jeff Dean

read summary →

TITLE: What Happens After A 1,000,000x AI Compute Leap? | Jeff Dean CHANNEL: Two Minute Papers DATE: 2026-06-01 ---TRANSCRIPT--- There used to be a chat group internally called data centers on fire that would have like exciting [laughter] uh exciting events happening.

A distant supernova goes off. A cosmic ray hits a memory cell and a zero flips to a one. Does that really happen? Oh, yeah. So, my question is do you enjoy these Chuck Norris-style jokes about you? It could be true. Um [laughter] One problem that you solved, tried to solve many times, but have never been able to crack. I cannot believe that this is happening, but I got to talk to a legendary engineer, the chief scientist of Google, Jeff Dean. He led Google Brain, one of the most legendary AI labs in history. He co-created MapReduce, which taught thousands of computers to work together as one. He co-built TensorFlow, the engine behind a huge chunk of AI research. And for all this, they call him the Chuck Norris of computer science. Yes, I will tell him a joke about that, too. Now, when I see interviews with these executives, everyone is asking about China and taxes and all that. Look, I know nothing about that. I am just a student who loves to talk about research. So, my goal was to try to go a bit deeper and ask him questions that maybe only he knows the answer to, which is incredible. I’ll also ask him about problems that even he couldn’t solve yet. And I will ask him about some of the secret sauce at Google and see if we get something. And more. And I am so happy to share it with you, fellow scholars, so we can learn together. I am not sure if I saw Jeff smile and laugh this much before, so I hope he enjoyed it, too. And once again, this is an incredible honor. I cannot believe that I was sitting there. There were some production issues with the video part. I apologize for those. Also, I was super nervous. I could barely hold on to my papers. Now, fellow scholars, let’s learn together with Jeff Dean. Thank you so much for doing this, Jeff. We talked a bit last year and I learned so much from you. It was incredible. And then I got a message that we we get to do this and I was so happy. So, thank you so much for this and and we get to share your knowledge a small part of your knowledge with with the fellow scholars. So, that’s that’s absolutely great chatting with you last year and I’m looking forward to this. Thank you. Thank you. So, everyone says that we are running out of training data for LLMs, but you you said that there is still plenty of data out there. What did you mean? Yeah, I mean, I think everyone has this view that we’re running out of training data and um it’s true we’ve like used quite a lot of of the public text data in the world. Um but I think there’s lots of interesting video data that we’re not really training on yet. Uh there’s lots of interesting kind of um ways to generate synthetic data and then use that for training. Mhm. And then I also think we can start doing things like uh making more passes over the data that we do have to make more and more capable models and also come up with algorithmic techniques that enable us to get a lot more information from every piece of data that we do have. So, I I’m not too worried about that as like an impediment to making progress. It seems like there’s lots and lots of things we can do. People also say that with so much simulation data as as you mentioned, sooner or later most of the data will be AI-generated. Which is then used to train a different AI and then suddenly everyone starts to, you know, learn on the same thing. But you said, wait. It still helps. I think the argument was that if you have enough compute, you can crunch through a lot of data. And if there is just a little needle in the haystack that’s useful, the system is able to learn from it. Is that true? Because my previous crappy little experiments, it it was not true at all. So, you had to be very careful with the data. Yeah, I mean, I think it is true in general. I mean, if there’s a lot of details to get right to make this a reality. Think about, for example, doing RL training and rollouts to, you know, figure out how to solve some fairly high-level phrased coding question. Right? So, you might explore a hundred or a thousand different ways of generating solutions to these problems, and you might have some, you know, some filters that you apply to these things like, does the code even compile? Well, you can throw out 800 of them right right off the bat. Does it pass the unit test? Does it like perform well? And so, you can really start to hone in on like which of these, you know, potentially many solutions to the problem is the one that actually sort of generates the highest, you know, characteristics that you’re looking for, the reward in some [clears throat] sense. And that I think is is definitely true. Like, more compute will generate you more interesting solutions, and then those can then be put into the training data. They can be enriched with like data augmentation techniques. You know, I generated the solution in Python. Now, I could generate a solution in Go, and have more Go programming language training data. That’s like a an incredible kind of augmentation. Like, augmentation before with convolutional neural networks, you know, it was just just shift the image by a couple pixels and whatnot. And here, the augmentation can be like completely different programming language and and whatnot. Yeah, I mean, I think, you know, a lot of times we think about coding-based problems as you go from natural language, which is often very under-specified. Like, you know, make me a cool Space Invader game or something. Um, but actually, if you have a program that already works, that does what you want, and you want to translate it, that’s awesome. Cuz in effect your prompt is the fully specified behavior of the system you want and you just want it in a different language for whatever reason maybe better performance or better safety characteristics or whatever. So, that we’ve seen internally with some tools that have been written in Python and people have been able to sort of just say, “Please use all the tests for this code and the actual Python code base and make different versions of it.” and found, you know, much faster solutions. So, you can you can suddenly get so much more out of the same amount of data, basically. Yeah. So, so that’s that’s why you’re not worried about the data. [clears throat] Okay, nice. Now, Bill Dally has said that something like 90% of what happens in modern data centers is not training anymore, which I I found really surprising. It’s inference. Like, there’s more less training and more using, like, relatively speaking. Um how does that shift the way you design hardware at Google? Yeah, I mean, at first there’s a lot of other things that are not either inference or training happening in data centers, like, all the applications we run and search and Gmail and so on. But of the sort of machine learning workloads, you know, I it is the case that training uh is becoming, you know, less proportion of the overall compute that we want to do because [clears throat] there’s so much, you know, inference workload you want to do. And the inference workload includes both, like, offline inference, uh sort of RL rollouts during RL training, uh and then also online inference for handling user requests or agent-based behavior. Because of that shift and the different characteristics of those two kinds of computations, it makes a ton more sense to now specialize much more for inference workloads in hardware, for example, because the characteristics are quite different. You need lower precision, you you know, are handling a very large volume of requests on this particular model. The model weights don’t necessarily change uh at inference time. All these things lead to very different solutions for hardware and much more energy efficiency can be gained by specializing. And so I think you’ll see a lot more in this area now and in the future. We’ve already done this with our TPU 8i and 8t chips that we announced a couple maybe a month ago. But you’ll see even more specialization I think. And that’s pretty crazy that you said that even FP4 kind of works. And I when I first heard it I was like it cannot possibly work and do anything useful. And and it does. Yeah, if you told that to a computer scientist from 15 years ago they’d be like Yeah, that’s that’s that’s not not enough numbers. Yeah, yeah, exactly. And I look at every now and then at these papers and you you you have these these different transforms that are the the the distance preserving transforms, rotations between the points and all kinds of compression but still FP4 that’s unbelievable. It’s not many bits for exponent. Yeah, and it’s a good sign that it works. Yeah, yeah. I I don’t know if we can get lower. What what do you think like even lower? Possible. I mean I think you know people are seeing and experimenting with things where you have some even lower precision and then every so many weights of that you know lower precision you have a scaling factor. And that seems like you get a little bit of a higher precision thing that kind of shared across all the other lower lower bit precision formats whatever they might be two-bit integer, one-bit integer, you know, I haven’t heard anyone say two-bit float cuz I’m not sure what that would mean. [laughter] But yeah, I think that plus a scaling factor seems to be able to get you pretty far. And the question is you like how often do you need the scaling factor? Is it every 64 or 128 or 256 weights? Pre and post-training are typically separate steps today. Do you see that split holding or do you expect the two to merge as as capabilities increase? Yeah, I mean I feel like it’s a little intellectually dissatisfying that they are these distinct phases and you do one and then you do the other. It like conceptually the right uh thing to do is to have interleave periods where you’re sort of observing data and then periods where you’re trying to use that new knowledge you’ve gotten from the data you have. with DQN this experience replay kind of thing. Yeah, and then you want to now take actions in some environment. Maybe it’s a simulated environment, maybe it’s the world with a robot or whatever it is and then, you know, learn from those actions because I think you get a lot more uh benefit from actually um taking actions and observing the consequences or trying to write code and seeing does the code work than you do from just passively sitting there and seeing tokens stream by you, which is really what most of pre-training is. It’s really interesting that you say that that in an interleaved manner because when I when I hear merging the two what what in my mind is continuous. Like continuous learning. But at the same time people have to test models. You cannot just chuck it out there, you know, you finish training, you finish the post and and then maybe the red teaming steps and and and you know, safety and everything and then you package it up and you say, “Okay, this is good to go.” But if there is continuous learning then then then there’s no challenges because how do you know that this intermediate state is actually safe? Maybe some more research there too. Yeah, I mean I think uh first like a bunch of discrete steps where maybe you do this a hundred times or a thousand times starts to look more like an integral than a summation. Mhm. Um and so um I do think interleaving in that way will make sense. But you’re right. Like you have a bunch of things you need to do for a live model that is serving user requests. You need to make sure that it’s safe. Um so, maybe that the continual learning happens, and then there’s some uh application of uh you know, safety protocols and red teaming, as you say. Uh and then you release a new version of that, but then that model still continues to learn kind of behind the scenes. And then before the newest version of it is provided to users, you redo the sort of final safety testing and and red teaming. Jensen likes to say that compute capabilities advanced 1 million X over the last 10 years. So, if in the next 10 years, assuming we get another 1 million X, what would we be able to do that we cannot do now? Yeah, I mean, it’s like imagining the future is always a hard thing because this field is moving quickly. I mean, I think if you think back, you know, 10 There was 10 years? 10 years. 10 years. You think back 10 years, you know, we were kind of just starting to have language models that were the sequence-to-sequence paper had appeared, you know, uh it was just before the transformer. LSTMs maybe. LSTMs were were popular. Um and now those models sort of look uh ancient not nearly as ancient and not nearly as capable as the models we have today. So, I think if you project forward that level of advancement, and you’re going to see huge investments in both like new kinds of hardware, um you know, new kinds of research techniques. Uh there’s just a lot more attention being paid to the field. So, I I see that progress rate not slowing down um over the next 10 years. And so, that’s going to be incredible. Like, the multi-agent workflows we’re now able to start to kind of get to work on very complicated tasks, like you saw in the IO uh keynote, being able to write an operating system autonomously with a relatively simple prompt. Crazy. Uh you know, obviously there’s a lot of operating system-y like things in the training data, so it’s not completely out of distribution, but you know, the fact that it’s able to build an OS that can run Doom uh, successfully is is pretty amazing. couldn’t believe it. I mean, I mean, last year I heard a talk from uh, Steven Balaban, the Lambda CEO. Mhm. And he had this neural OS. Mhm. Like, hey, you know, it it does more and more like like forget the UI, forget forget the maybe the drivers, I don’t know. Right. But but just let’s let’s have a neural OS. And I was like, yeah, that that sounds like an amazing science fiction idea. I would love to see it, but I don’t know. I mean, it sounds far off. A year later and and we got Yeah. I mean, not not exactly like that, I know. But but if if you look at the derivatives Yeah, I mean, I would say one thing I’m particularly excited about is you know, can we with these tools accomplish so much more in, you know, science, Demis was mentioning in the keynote, or in, you know, complicated engineering tasks that often would take you know, lots and lots of people multiple years to accomplish, could you actually have a system that with the correct access to the right kinds of simulation environments and a learning set of agents that are trying to accomplish the task and break it down into smaller tasks, could you design an airplane in, you know, 5 days instead of, you know, many many years? That would be amazing. I would One one million X and we we can we can try again. I mean, we’re not there yet, but that would be a pretty pretty amazing capability or designing new new computer chips or computer systems, new hardware. Um, you know, I’m pretty excited about that. Yeah, incredible times. Are open models standing on the shoulders of giants? And by that I mean, if if frontier models suddenly stopped being released, would open models improve as quickly as they do now, or is their progress mostly driven by distillation? Yeah, I mean, I think certainly a bunch of the progress is driven by distillation. For example, our own Gemma models are definitely distilled from higher quality larger scale models. And I think a lot of other open source models are getting benefit from distillation data. Distillation has always been a you know, amazing way to get really capable models into a smaller footprint thing. And you know, that’s how our flash models are quite capable for their size relative to the pro models as we’re able to use the pro model to to teach the the flash models. So, I mean, I think really the the question is uh not so much one of closed versus open. It’s, you know, if we want small incredibly capable models, we have to keep building larger scale models that are maybe less inference efficient, but are more capable. And then use distillation to you know, transfer the knowledge into into the smaller models, whether they are open or closed. Now, I’m I’m wondering, you might be the only one who can answer that. So, I I’d really want to ask this. Everyone has their their flagship models and yes, the distilled models like pretty much every company does this tiered level thing. The quicker faster models are always were well below the the frontier models. And at some point, I think 3.1 or there was one version where where the the quick one was suddenly so so close to the frontier one. There was like a 3% difference Mhm. in in [clears throat] in tough benchmarks. Yeah. And and I just heard someone saying, I don’t even know who that was that that yeah, it’s not like just distillation. There is some magic sauce in there that’s been in the works for years. So, can I hear a bit about that? Sure. Well, not too much. I mean, there is always a magic sauce that we don’t reveal, but distillation is definitely one of the key things that makes those you know, much smaller models, much cheaper, much faster, much more affordable models be, you know, nearly as good as those frontier models. And then we push ahead and build um even better frontier model, and then we have to then do the process again where we now sort of transfer the the knowledge in the really capable frontier model it back into a a lighter weight one. And I think um you know, this is this is really important because the flash models are really the workhorse of what people generally want to use cuz they’re you know, they’re almost as capable and they’re super fast. Yeah. Yeah, and uh and they’re they’re quite good. Yeah, it’s unbelievable how close they can get. Like this this didn’t used to be like that at all. All right, what trends in machine learning are you most excited about right now? You you have a separate talk about like exciting trends in machine learning or Oh, yeah. Yeah, I mean I think What’s what’s the newer version of that? Yeah, the newer version I guess uh I mean there’s a few different trends that I think are really exciting. So one is um uh So first, I think continual learning is still a little bit nascent, but I think looking at ways to make models that are more interleaved in their way use of so they’re sort of seeing data passively and taking action and learning from that seems like a really important thing. Uh you know, agents and multi-agent use of uh these systems is really really important. Um as one trend of that though, I think as you see uh you know, we’re going to need a lot more inference hardware and capability for that because those systems that are working autonomously in the background actually consume lots of tokens in order to sort of do the the kind of important work they’ve been asked to do. Um you know, I think uh being able to build really efficient inference hardware will enable a lot of of things. So looking at, you know, co-design of model architectures and hardware architectures to make sort of the best use of um things and have really good properties in terms of very low latency, you know, much higher performance per watt, performance per dollar are things we we really care about. Um, you know, I think looking at how do you, you know, the context window of these models is an important characteristic, but uh, I think there’s a lot we could do if we come up with mechanisms that are sort of cascaded series of things that kind of give you the illusion that you have all information in the context window. Like you’d like to have the whole internet at your model’s fingertips. Or on a personal level, if you’ve opted in, you know, all of your email and your photos and your the videos you’ve watched and things like that. Um, but you can’t really do it with the sort of quadratic attention mechanism, but I think you can build a series of kind of retrieval and lighter weight mechanisms and then ways of cascading from, you know, here are the 30,000 documents out of 10 billion that seem most relevant and then, you know, have a lighter weight model that looks at those and decides these 117 things seem really relevant to what you’re trying to do and puts those in the sort of more expensive context window of a a bigger model, perhaps. Uh, that’s going to be kind of exciting and how do you orchestrate and interleave all that stuff so it gives you the illusion, uh, without you having to even think about it. Interesting. So, it’s very advanced games to be played with the context window because obviously very expensive. So, the attention mechanism you got you got big O of n squared. Uh, are we still there or are do we have some I mean, I’ve heard some n log n things. Can we go lower there’s like a whole series Obviously we can go lower, but the question is what what the trade-offs are, right? Like what what do you have to pay for that? Um, where are we in that? Yeah, I mean, I think there’s actually quite a large body of work there, probably, you know, 100 papers on more efficient context, uh, algorithms than than the the n squared one. I mean the N squared one works really well. So it has a pretty high bar, but I do think there is traction in finding things that are you know much lower cost, whether it’s you know reducing algorithmic factors or very large constant factors on the the base N squared algorithm. I think all of these are pretty exciting. You can actually combine many of these these approaches. Um and and get you know much cheaper attention over many more tokens. Yeah, I think that’s one of the most important things because if it was cheaper in some sense and and and you could still find the the needles in the in the haystack over very long context, then you could you could have some sort of lifetime AI thing. Yeah, totally. Like I’d like my whole life of all the digital things I’ve seen in there as a say internal Google developer, I’d love for the entire Google code base to be in there which is you know probably 10 billion lines of code. Probably you know 100 you know 100 billion tokens. I want my wine list in there, what I like. All I want is a 100 billion tokens of attention. It’s all I need. Amazing. I think we got to do this one. So Google’s data centers run an enormous number of machines and at that scale anything that can go wrong will go wrong. Like I hear that wires wear down, hard drives fall apart, motherboards overheat. Um is that something that actually happens day by day and do you have any good stories? Absolutely. I mean I don’t have that many personal stories, but there used to be a chat group internally called data centers on fire that would have like exciting [laughter] exciting events happening and sometimes exciting videos. Um yeah, I mean I think at scale lots of things that are very very unexpected happen and usually those are the combination of one thing fails and something else fails simultaneously or in during the Yeah, you have a cascaded failure of some sort. You know, sometimes that means some software system stops working. Sometimes it means like the the bus bar overheats and you get too much power to the to the rack and like it catches on fire. I mean, that’s a much rarer thing, but um you know, you have to be prepared for this and I think one of the things even from the very earliest days of Google is we’ve really focused on how do you build reliable systems out of unreliable parts. Yes. Right? Like in the earliest Google days, we were buying consumer machines without uh ECC memory. Didn’t not not only not ECC, not even parity. Mhm. Uh we were buying consumer motherboards that didn’t have like redundant power supplies and you can do that if you can handle things at a higher level and that’s generally what we try to do in all cases is I actually wanted to ask you about that the ECC thing because here here’s one of my favorite failure modes. If if that’s true, but you you tell me, a distant supernova goes off, a cosmic ray hits a memory cell and a zero flips to a one. Does that really happen? Oh, yeah. Yeah, absolutely. I mean, alpha particles definitely can flip uh you know, DRAM state. We’ve actually observed this cuz we have monitoring data of how many ECC uh errors and like single-bit errors that are corrected and two-bit errors that are not corrected are happening in all of our machines. And you can actually see this where some clusters that are pointing in a particular direction on the Earth have a much higher rate for a you know, a brief period like 10-minute period or something and then the other ones on the other side of the Earth do not have that. So, it’s definitely something that happens. How worried should I be because MacBook Pros don’t have ECC memory as far as I know? Like for for one machine, is it so vanishingly, you know, unlikely that you shouldn’t care but for data center or I mean, for one machine, it’s generally not too bad. I mean, I think they have parity, so at least they detect it typically if it’s a single-bit error. So, detection but not fixing. Right, but ECC usually gives you single-bit error correction and dual-bit dual error uh detection. Yeah. So, for with that, you don’t have to worry about it too much um uh at a single machine level, but even at, you know, tens of thousands of machines, you do have to start thinking about that. So, you know, one of the things we did when we were using machines without even parity is we built an entire software-based checksumming system for large amounts of our data. Doing it by hand? Doing it by hand, essentially. [laughter] And like we would you know, for crawling web pages and putting them in the index Mhm. you know, if you detect that this particular record is corrupted, it’s usually generally okay to just you know, ignore that record. Mhm. Now, I have something interesting for you. I call it lightning round. So, please try to answer in one sentence. One word is okay, one one sentence. make run-on sentences? [laughter] We’ll see. We’ll see. [clears throat] So, I I read that Jeff Dean’s PIN code is the last four digits of pi. I I I give this one an an eight out of

  1. So, my question is do you enjoy these Chuck Norris-style jokes about you?

It could be true. Um [laughter] uh I I do enjoy them. I mean, it’s a April Fools’ joke gone awry by my colleagues in 2009, but it’s very both flattering and kind of embarrassing. [laughter] I think I think he felt the same way about them, too. He he he enjoyed them, too. Legend. All right, one big thing that you were wrong about and came around. I think AI is going to influence health care quite dramatically, but I think it is harder not necessarily for technical reasons, but for you know, how do you actually get Regulations and regulated industries that are super important and have all kinds of privacy constraints and safety concerns. Mhm. But I think ultimately that will happen. It’s just taking longer than than I I hoped. Yes. Cuz I think there’s tremendous world benefit to do it. Um but we need to do it carefully and safely. Vim or Emacs or something else? Hint, there’s only one good answer. Emacs. [laughter] Was that it? Oh, no. Look, I’m a Vim person, but I’m not maybe I’m I’m an embarrassment of a Vim person because I I I looked at Emacs too and I was like, that’s pretty cool too, but I don’t want to learn both. It’s just so much time. Yeah, it’s true. One can spend a lot of time customizing Emacs. Yeah, the Vim RC’s I wrote up and then and then it never ends. Yeah. One problem that you solved, tried to solve many times but have never been able to crack. I mean, I think in some sense we still don’t have an answer to how do you do continual learning appropriately. That’s something I’ve thought about a little. I’ve I’ve dabbled a little bit with some some techniques along with colleagues, but I think uh you know, if we’re able to crack that, it’s going to be amazing, um but it’s not there yet. Last one. Favorite Two-Minute Papers episode. [laughter] Oh. Yeah, I mean, I assume the the Transformer one was a good one. All right. All right. Well, that’s a that’s a good one. Okay, Jeff. I I learned a lot today. Thank you so much. This was Thank you so much. Here you see me running the full Deep Seek AI model through Lambda GPU Cloud. 671 billion parameters running super fast and super reliably. This is insane. I love it and I use it on a regular basis. Lambda provides you with powerful Nvidia GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers or click the link in the description.