Transcript: Mse435 Economics Of The Ai Supercycle Enterprise Internal Knowledge

Today we’re going to talk about this part of the stack. models and how do you build better models for better applications. Our guest for today is Yash Patel, founder and CEO of Applied Compute. I’m so excited to have Yash here, not only because he’s a grad a recent Stanford grad. He’s going to talk a lot about his his journey, but also because he was one of the very few undergrads who went directly to OpenAI research after Stanford, was a part of the post training team. Started Applied Compute after OpenAI because of an insight that he had during his work at OpenAI and has built Applied Compute into one of the most successful businesses applying his learnings to incredible enterprises. Yash, thank you for doing it. Please join us.

[applause] Thank you. Awesome. That’s on. Cool. Thanks for having me. Thanks for joining us. Yeah. Yash, tell us a little bit about yourself. You’ve had an incredible journey and you’ve made some some tough choices. Actually, we were talking a lot with this class right before you joined about decisions about what to study. So, walk us through your your journey Yeah. and that led you to today. Yeah, yeah. So, so I’ll talk a little bit about my my journey. It’s not actually that long. I I was actually sitting in here taking finals not that long ago and I’m I’m class of ‘25, so you know, was hasn’t been that that that long since I’ve been on on campus and stuff, but yeah, so so you know, grew up in Austin Texas, came here for school. I I like to say I was a very good student in high school. Was was bad student here. Um not grades or anything, but you know, kind of never went to class, watched the online lectures, did that sort of stuff. Um and sort of the the the rapid-fire history is ended up building a bunch of stuff on campus. Um got connected to Sam Altman very serendipitously serendipitously through through some mutual friends. Um one thing about Sam that I think not many people know is he has an incredible soft spot for helping young people early on in their careers. Um so, yeah, ended up meeting Sam. We hit it off uh freshman year summer. Um you know, a friend and I were kind of deciding, “Hey, do we want to do our summer internships? Do we want to go and work on our projects?” Ended up saying, “Hey, let’s go work on our projects.” Renegged on our internships. And we’re kind of looking for money. Um Sam uh we shot Sam a blind email. He gave us a very small check to kind of like cover food and and rent and things like that. So, we worked on that for the summer. Ended up shutting it down, coming back to school. Uh so, that you know, came back for my sophomore year. Um was doing a lot of fun things here on campus like Tree Hacks. Shout out to Tree Hacks. Uh that was kind of my main thing here. Um really loved putting on that hackathon for folks. Uh but then late 2022 is when ChatGPT came out. And I was kind of just like playing around with it. And I was like, “Holy crap, this is the coolest thing I’ve ever seen. I have to go work on it.” Like I couldn’t think about anything else. So, ended up shooting Sam another email saying, “Hey, how do I come and work on this thing?” Um he, you know, put me in touch with what was called the OpenAI residency. I think it still exists. It’s actually how a lot of folks at OpenAI went from, you know, academic researchers or people in different industries to full-time employees at OpenAI. Um joined OpenAI early 2023 on the post-training team. Worked with some people I really really looked up to in in the language model, you universe. Um starting on Evals, which you know, tip for anyone here is whenever you join a company, work on the the sort of like hairiest thing that no one wants to work on cuz people will like you for it. So, I know we’re on Evals for the first year and then the second year was kind of when these reasoning models uh started coming out of the woodwork. So, uh people were training these reasoning models on primarily competitive math and it was kind of this like wow moment for everyone at the company where we’re like, oh you know, we’re seeing massive performance increases in in using these models. Uh so, a friend and I were like, hey, what if we actually try applying these models to things outside of competitive encoding and math? I I wasn’t a competitive coding or math kid growing up. A lot of Frontier researchers folks were. Um so, we ended up hacking together this agent that could kind of browse the internet, write some code, showed it to a bunch of leadership. They were really excited about it. So, we started this team called Long Horizon Tasks and I was primarily focused on leading a lot of the agentic coding research, which eventually became became Codex. Um but yeah, left left the left to start Applied Compute about a year ago. So, our one year was actually last Saturday. So, um hasn’t been too long, but uh yeah, we we we sort of saw this gap where, you know, these models were getting really really smart, but when you actually went to go and apply them inside of the enterprise, they’re like they’re like smart geniuses that know nothing about your business. And inside of enterprise is actually where you have most of the data in you know, in the world, right? Like all of these companies have have tons and tons of data, proprietary data that they’ve built up over time. So, we’re actually helping companies take the same frontier technology that led to these smart reasoning models and create their own specialized models um to sort of enhance their business. Amazing. Yeah. What a journey. Yeah, it was a lot of fun. Yeah. Um I thought we’d start at the models, Yush. So, I went to the past century. Things have clearly escalated. And then if you zoom in to this side of it, uh post-AlexNet, uh things are moving pretty fast. Yeah. What like could you put this in context? What What is going on in the model layer frame for us? Uh why is the advancement in the last 4 years notable? And And what is driving it? Yeah. So, And I got some of your slides if you want to Yeah. Yeah. Yeah. So, scroll through them. I thought we’d start with like a bit of history. Um you mentioned AlexNet. Uh does anyone here have Have you guys heard of deep learning or what that is? Of course. Okay. Nice. Nice. Um deep learning was was kind of like AlexNet was kind of the the I I’d say the pivotal moment for deep learning. And it also is kind of the moment that we stopped understanding what any of these models actually do. So, essentially what deep learning is it’s it’s it’s a method of piece of meat machine learning technology that allows you to learn underlying representations from data. Um and and you basically like train a train on a bunch of data, uh push in a bunch of compute, and you get out these really smart models that, you know, are are are made up of, you know, millions or billions of parameters. You actually don’t know what they do, um but they actually are really good at doing, you know, tasks like prediction, um language model, you know, like next token prediction is is what like LLMs do. And the sort of before and after was before AlexNet, people were creating handcrafted features, sort of looking at, um you know, pieces of underlying data, training various sort of like call it call it rudimentary classifiers to detect edges, things like that on a lot of vision tasks. And then what AlexNet did is they applied GPUs, um a massive data set called ImageNet, against neural nets, which um led to the sort of breakout moment where you actually proved that, hey, if you scaled compute and data, you’re able to see these massive gains in sort of predictive accuracy of these models. You know, the model development has a lot of different uh aspects to it. Yeah. Uh maybe give us an overview. I know we’re going to go deep into a couple of aspects of it. What does model training the modern day model training look like? Yeah, so so I think this is kind of a snapshot this is like by no means the uh a most detailed timeline of events of things that have happened, but I wanted to pick out a few a few pivotal moments um in in recent years starting with the the transformer. So um you know, I’m sure everyone here has heard of the transformer and sort of self-attention. This was sort of the moment where uh you know, researchers at at Google Brain came up with a new architecture that actually allowed scaling language model training. Um it was way more performance on existing hardware. So G like could actually run these workloads on GPUs and compared to previous method like previous neural net architectures like recurrent neural neural nets or LSTMs, they were able to employ this technique called attention which basically led to way better performance in next token prediction and could scale to these massively long sequences in language. Um sort of fast forwarding over the years uh 20 2018 to 2019 was really this era of pre-training. So people were taking these massive corpuses of text teaching models to sort of predict the next token by optimizing on loss. So basically you have a model try to predict the next token, you see what the actual next token was in the in the corpus of text, and then you do back propagation on the model weights to sort of tweak the model so that it’s more likely to predict what that ground truth uh next token was. Then you sort of entered the era of scaling laws. So starting with like the open AI scaling laws which are showed hey, if you actually scale these models up and make them really really big, you start to get much better performance. Um so this was like the Kaplan scaling laws uh really proved out with GPT-3, which was the first kind of model that seemed to have like some level of general intelligence. So that was kind of a breakthrough moment. And then you continue to compound on this with like the Chinchilla scaling laws, which said, “Hey, not only do you need to make the model really, really big, you should actually There’s actually a compute-optimal way to scale these models. You both make the parameter size uh much larger, but you also should train it on much more data.” Um and then once we started to have these models that were sort of generally useful, uh it became a sort of story of how do we actually make them useful to the to the normal person? How do we create like these inner interfaces where it’s like a tool that the everyday person can use? And this was kind of the era of of reinforcement learning uh with human feedback, preference tuning, sort of being able to steer these models because general base models are are just doing next token prediction, right? So they hallucinate a ton, they don’t actually answer your question, they may say things that are unaligned or not, you know, not up to safety standards. And then GPT-4 was kind of this like, you know, next next-level step change in the quality of these models. Um I’ll kind of go through these very quickly. I think this is what people are probably most familiar with in the past couple years, which are the era of reasoning models. So, you know, in in in 2024, um you know, OpenAI came out with this model called O1, which was kind of this new axis for scaling model intelligence, which was test-time compute. And this was kind of felt this I I I want people to to to know that chain of thought is like a completely emergent behavior. Um the the model reasoning, whenever it answers your question, it’s sort of spending time thinking, correcting itself, no one trained it to do that. Basically, by putting it in these constrained RL environments and then funneling a ton of compute towards it, you actually got these models who had this emerging property to be able to reason. And then, combining that with tool use, which a lot of people, you know, if you guys use Claude code, Codex, deep research, you actually started to get these agents that could reason, work for really long periods of time, and sort of become what people are calling today AI co-workers. Fascinating. Um do you have another slide? Okay, great. This is this is perfect. So, you know, we have understood there to be multiple ingredients of that go into making a great model, data, compute, talent, Yeah. algorithms, uh maybe other things that I haven’t listed. Yeah. What is the bottleneck today? What has been the bottleneck in the past, and what do you suspect will the bottleneck be in X years from now? Maybe maybe a maybe a tour of history and a prediction for the future. Yeah, so I mean, kind of running through what we what we just talked about, you know, the the bottlenecks kind of went from having the compute to train these models to the correct architecture to actually being able to scale to the pre-training, you know, levels of data that we need, like the entire training on the entire internet, being able to make these models usable by preference tuning them, and then, today, what the bottleneck is is, I’m sure people have heard of like RL environments, which is actually how these very recent families of models have gotten so good at at reasoning, um and sort of intelligent thinking. I think what the bottleneck is for the future is is it’s is basically this idea of continual learning, which I know we’re going to talk about a bit later, um but so far, we’ve kind of gotten more more and more data-efficient methods of training. So, pre-training, not super data-efficient, like you have to train on the whole internet to, you know, get the base model. No one, you know, people here, uh you to learn something, you don’t need to go and like sort of like read internet-scale data. So, that that that was super data-efficient. Then, we started to kind of go to these these methods in post-training, which were a little bit more data efficient. The most data efficient today being like RL environments, but the thing that is kind of the holy grail and I think what people think of when they think of ASI or AGI is like how can a model go and do something once and learn from extremely sparse reward. So, you know, just like, you know, you guys if if you’re you know, you you you go and burn your hands on the stove, it’s uh you just need to do that once and then uh you know the stove is hot and not to put your hand on the stove. These models today are not really like that. Um so, I think like this this this idea of continual learning and being able to be extremely data efficient with like in real world interactions, that’s the next bottleneck. Got you. Got you. Super helpful. Well, you know, when you burn your hand, it’s also very loud feedback. Yeah, exactly. So, so hopefully continual learning is giving us loud feedback. Yeah. Um And you know, the other big question, maybe maybe not on the slide, is uh why have all the labs converged at focusing on software engineering? Why have they focused on code as the first frontier? I’m sure there’ll be other frontiers like life sciences or or or cybersecurity or others that I don’t know about, but why software engineering? Why has What is the unique property of code that Yeah. people find so interesting? So, so the type of RL training that these labs are doing is are like reinforcement learning with verifiable rewards. So, in order to actually like get the learning signal or the reward signal, you need to have a deterministic way to check if what your model did was the correct thing. Mhm. Code and math are really, really good for this because what can you do? You can compile the code, you can run unit tests against it. You can actually check to see if the code was is doing the right thing. Mhm. So, you know, that that is one reason why code code has been super valuable. The other is like it’s really easy to make a lot of synthetic data on this. Right. Scale of data. Yeah. The scale of data, the prior is really good. There’s a ton of code tokens on the internet. Um and then I think the other thing is just like a lot of researchers, uh me included, think coding models are kind of like AGI complete in the sense that every task, when you kind of boil it down, is a coding task, right? Like so that’s why you see Claude and a lot of these other uh models writing code to do instead of doing like uh you know, tool calls or or or or things that are are more specific to the task. They’re actually just like using code as a general language to interact with the the real world. Mhm. You know, one of the questions we were discussing right before you came is how do we get good at jobs with AI that are not code or code adjacent. Give you an example, um making slides. Mhm. Like this one, by the way, completely generated with with with Claude code work with a with initial set of conversation that we had. Yeah. Um what is the relationship between code and slide generation that that would make these models good at generating slides because they got good at code? So did you did you make these slides with the Claude code? of the formatting. Oh, great. Okay, yeah, yeah. So so basically um you know, I’ve also made slides with with with Claude code and Mhm. basically it you know, it’s able to make the this table. It’s able to set the formatting on the on the title. It’s able to put that random blue line there. Exactly. But what you can do is you can combine the outputs of this model with other auxiliary rewards to sort of tell it how good the code was that it wrote. So here, this is like extremely functional, but if we wanted to really optimize for aesthetics, we could combine not only the the you know, the code execution and actually being able to make the slides and make them structurally relevant, but also some sort of reward model that can look at the the the output of the slides and it’s been trained on human preferences of what aesthetically pleasing slides look like and what you know, ugly slides look like. And you can You can those rewards and jointly optimize for both writing, you know, the functional slides and then also making them look pretty. Right. Yeah. Right. Right. Right. Well, this is a great segue into talking about your work. Um you know, talk tell us a little bit about pre-training, post-training. What do those words mean? Yeah. Um and I know you are the world’s best expert on on one of those. So, we’ll we’ll dig into that. That’s that’s very flattering. Um but but yeah, I mean, you know, sort of the in in language modeling, I think the two big buckets talk are pre-training and post-training. Pre-training is sort of this massive uh training effort where you take internet scale data, you know, trillions and trillions of tokens. You throw a ton of compute at it through this, you know, architecture the transformer, and you train a neural net to get really good at sort of learning patterns in language. Mhm. And you know, the thing that falls out of this is some form of intelligence. So, essentially what what pre-training is is is is this idea of compression where you’re able to actually take, you know, all of human knowledge, i.e. the internet, and put it into a set of weights that actually understands, you know, the patterns in language, how to think about things, whatnot. The problem is once you have this pre-trained model, which by the way takes like orders of magnitude more compute than post-training, you need actually need to go and align this thing. So, it’s just next sequence next token prediction. So, if you write a sentence like, you know, um who should I invite to dinner? And it starts to basically say like a bunch of random names or something like that. That makes no sense because like you you really the model should be like, “Oh, I have no idea who’s on your invite list or like who you know. Please tell me who these people are.” So, post-training is actually the the process of taking this model and telling it what good and bad outputs look like. And you actually get a model that, you know, learns a chat format where there’s a user message and assistant message that responds to you. Um you learn you learn, you know, safety guidelines. So, if a you know, if a uh sort of user asks, “How do I make like a weapon or a bomb or something like that?” You can actually tell the model, “Hey, don’t tell these people how to to go and make these these harmful weapons.” And sort of like in both of these these cases, what’s what’s scarce is is is data. So, um, you know, I have this slide here which is like very dense. Um, you know, you you guys probably don’t need it to look at too closely, but um, essentially what what pre-training does is it’s just optimizing um, for for for loss, right? So, how do you get really good at predicting that next token? Um, but what happened is, you know, the you know, we talked about this Chinchilla scaling laws. You’re scaling model size and you’re scaling the data that you’re trading these models on. We’ve just ran out of data. Like, there’s only so much data on the internet and we’re sort of I know you have the slide later talking about this, but we sort of have approached uh, the frontier on like what data is available to these labs. Um, and like sort of the labs are the only ones that can sort of do this level of of of of of training because it requires so much compute and so much data. It’s a huge CAPEX um, you know, uh, requirement. Um, and then, you know, I think on the post-training side, uh, you sort of have all these different methods for, uh, you know, training models from supervised approaches, uh, like SFT to, you know, this preference tuning RLHF to RLVR which will will will talk a bit more about. [snorts] There’s a really good uh, article actually in the readings. If you guys read Karpathy’s write-up on RLVR in the readings, that talks about what happened in 2025. You know, RLVR really came to prominence in in in 2025. Um, and we’ll we’ll get into that in a second, but actually if you if you just click one slide forward. Oh, actually here, can I Yeah, yeah. I had a question on data for you. So, yeah. There’s a lot of uh, uh, we spoke about this. Data. Yeah. Running out of data. This is the point you were making. Mhm. Um you know, we had Ali Goetz here or 2 weeks ago and he spoke about actually most of the data beyond this frontier is going to be AI generated. Mhm. A lot of tokens on the world will be just AI generated given the volume of which they’re being generated. Yeah. Talk about that for the second. Like what is the what is the frontier of data? Where do we get more data from here on? The model that will be trained in let’s say 20 2030, what is the input to that and and where do we get it? proprietary public Yeah, so so I think I think you know, there’s multiple layers to this question. The first is And there’s a whole economy Sorry to interrupt you. There’s a whole economy of these companies Totally. whose full-time job like scale and record and others is to maybe create maybe touch on that as well. Yeah, exactly. Yeah, so so so this I think is a is a visualization of of of pre-training data which is really just about like scale. Um you have people who are starting to like buy old libraries with ancient books in them, going and scanning these books to get more tokens. You have a lot of investment in synthetic generation. So how can you take uh primary source documents and sort of explode them to multiple you know, orders of magnitude more tokens and see if you can learn more from that. Um so I think pre-training, you know, that that is kind of going to be the methodology. Um really what people are focusing on now I think in pre-training is new architectural advances um to actually make better use of the data that we have today because on principle, right, like you shouldn’t need internet scale data to learn a lot of this stuff. So people are trying to be like, “Okay, how do we actually use the data better?” Um so there’s all these like data wall challenges that you know, the frontier labs are are working on. Now, what you’re talking about is like RL environments. And so this is kind of a a different type of data. This is, “Hey, let’s actually construct the world that the model operates in, have it go and do a bunch of things, and then exchange, you know, compute for less high-quality data. So, basically, like, we we can use way, way, way more compute and learn a lot more from a single sample or or rollout. So, you know, we were talking about code a little bit, and we can talk about this on the RLVR slide. When you’re When you make a code environment, it’s not like pre-training, where you’re going and, you know, training on a codebase and sort of just learning all the tokens in the codebase, you’re saying, “Hey, I want you to go and implement this feature.” You actually have the model try it hundreds or thousands of times. You have a way of checking if the model actually did the correct thing, right? So, that’s that verifiable reward, and you get this distribution of rewards, cuz sometimes the model will go and do it correctly, sometimes it will do it wrong, and you’re actually able to learn way more from that type of training than you are from pre-training alone. So, just next token prediction. Fascinating. Yeah. And, um, maybe the same question for evals. Yeah. I think you had something on evals, but like, talk to us about evals. Why is, um, why are evals important? Why do labs guard their evals? You know, there’s a lot of chatter about this being the most, um, the the the most protected asset. Mhm. Tell us all about it, and as you, you know, you referenced this being the hairy job to done. Why is that the case? So, I think, like, you know, as we as we start to train these these models on, um, essentially, a reward functions, what becomes uh, sort of the most important thing is actually knowing what good and bad look like. So, evals are a way of of benchmarking your your your model and and sort of, you know, given a certain task, understanding how the model acts. Um, the reason why, you know, evals are so important to the labs is because it evals set the road map. So, if, you know, if we want to go and train a really, really good code model, um, basically, SweBench, I think, was was kind of the the eval that sort of started the whole, you know, co- code code model race. And that was because people had something that they could sort of optimize towards in terms of like, what is useful coding look like? Now, TreeBench I think is a very flawed eval, and there’s been a lot of new evals that have come out since that are that are much better. But that’s kind of the whole point is like, whatever hill you want to climb, you first define it with an eval, then RL is kind of this like eval maxing machine. So you go and create a a, you know, a training pipeline that looks very much like your eval. Um obviously different data cuz you don’t want to overfit directly to the those eval data points, and then you just climb that hill. And then it’s onto the next eval. Mhm. So, this is also particularly important when it comes to enterprises. Enterprises have in internally have their own idea of what good and bad looks like, right? Good and bad is not the same across, you know, like a JP Morgan and a Goldman Sachs, right? Like they they have different standards, they have different ways of operating. So they will have their own evals. And you sort of get this like tiered effect where, you know, there’s these evals that the model labs optimize towards, and then there’s these evals that the the enterprises optimize towards. And we’re actually that layer, that specialization layer, Applied Compute, to sort of help enterprises optimize to their specific evals. Fascinating. Yeah. Good segue into Applied Compute. Yeah, yeah. So, what led you to start Applied Compute? Um Yeah. Why better to do it as an independent business than inside of OpenAI where you were before Applied Compute? Uh and what do you guys do? Yeah, so so um you know, like I mentioned, we started Applied Compute a little bit over over a year ago. Um and it was, you know, I started it with my co-founders, uh Rhythm Rhythm and Lyndon, who were actually both students here at Stanford. Um we were also all at OpenAI together. Um funny funny story is like, when I joined, uh you know, uh Sam was basically like, “Who’s the smartest person you know?” That was Rhythm. Couple months later he asked Rhythm the same thing, that was Lyndon, and that’s how we all sort of ended up there. Um but we really started Applied Compute uh based on this core idea that the future is is very specific to enterprises where you’ll have these general models, which are workhorse models, um but actually going and specializing them towards individual enterprises’ needs is actually going to be how people differentiate. So, general models sort of set the floor, but in order to set the ceiling, you need to go and build, train models, create these specialized systems in order to differentiate yourself from all your competitors. So, an [clears throat] example here is, you know, DoorDash is a customer of ours. Uh I’m not sure, you know, a ton of people uh order order all the time. I’m I’m very guilty of of ordering it like way more than I should, but It’s all a part of the RL environment, right? I know, exactly. We were just testing the testing the product. But, um so, one of one of the the sort of tasks that we worked on DoorDash with, and you know, I’m picking this one because it’s very practical, it kind of shows what what what what what we do. Um DoorDash onboards like 100,000-plus merchants every year to their platform. Mhm. And when these these merchants come to the platform, they basically supply a bunch of unstructured information about their business, including menus. And menu extraction, actually being being able to go from uh you know, images like this to a DoorDash storefront is actually a really really hard task because I see you guys doing that. Exactly. DoorDash has these this very specific style guide for how modifiers are supposed to be attached on top of items, you know, what you can [clears throat] mix and match, what’s an add-on versus like a you know, special ingredient, things like that. And when we tried using the general models on this, they just weren’t able to sort of like do that task. So, instead of And we you know, we tried prompting and all this sort of stuff. What actually ended up being the solution is you could take, you know, outputs of of our model. You could have humans go and correct those those menus and understand the delta, and then we could basically during training have a model have a model’s output be checked against the ground truth and essentially we had a way of you know quantifying the the loss or the you know the reward like how the error rate essentially and we were able to just optimize directly against reducing error rate. So this is a very clear example of how like a company just needs to go and define what good and bad looks like and you actually don’t need to do prompting or any of sort of this stuff. You can just directly optimize towards like the outcomes that you want. Fascinating. Can I ask you a follow up on this? Yeah. So you know the prior gen let’s say before transformers this thing would be a problem that an OCR model would have would have would have been applied to like a vision model of some kind. Yeah. Is the part where you guys come in optimizing this specific problem and this is not using a vision model using a transform model? We’re using a VLM. Yeah. So so it’s like a vision model with you know transformer architecture. Yeah. Fascinating. Um why would DoorDash and and sorry to ask you a hard question on the spot which is off curriculum? No no no. Um why would Door why would you guys specialize an existing model when maybe there’s a chance that I’m making it up GPT 17 Mhm. might be out of the box much better. Yeah. And are you incentivized to just wait for the next series of models or should should you work with applied compute to Yeah. No no no. It’s it’s it’s it’s a great question. I’m glad you asked me. Um I think like what what people you know often don’t realize is that enterprises care about being at the frontier at any point in time. So GPT 17 is going to be you know quite a long time from here. I think you know by the time we have ASI or this this this model that kind of controls everything which actually I I I don’t believe that’s going to happen. I I I think the the world is just very fragmented place and you know if you just look at where the data is it’s it’s it’s kind of you know dispersed. But yeah the the to value is just way what the ROI on being able to train your own models today with uh you know, way less compute like RL has become very very data efficient. Um you need to use an order of magnitude less compute than pre-training or these other types of of training and you’re able to optimize performance uh you know, like way more than you were in SFT and RLHF. Makes it a lot more appealing to be able to train these models. Like The order of magnitude investment that goes into post-training uh pre-training versus post-training. Can you give us an estimate for that? So, let’s say it was a $100 budget for training, how much of that would be pre-training, how much of that would be post-training? Yeah, yeah. So, so I think I I think I looked it up on on the way here. Um DeepSeek V3 uh was trained on about like 2.4 2.5 million H800 hours. Um the RL training for so the the training that led to DeepSeek R1 was trained on like about 150k. So, comes out to about 5% of the training compute that’s needed for for pre-training. But, That’s it. That’s it. But, I think what’s really interesting is that trend is is starting to to to change where people are pre-training these models, but then they’re also doing data center-wide multi-data center-wide RL runs because you have these scaling laws which uh I think I had a photo of Jensen and the the three scaling laws. There’s pre-training scaling and there’s post-training scaling and there’s test time scaling test time scaling inference. But, post-training scaling you can actually uh you know, massively increase the batch size of of each you know, each of these training steps and you get you get a lot you get a lot better performance. You can have these models do a lot more reasoning when they’re attempting these tasks. So, I think the trend you’re actually seeing is that the compute spent on RL is actually increasing quite quite a lot. Got it. So, it’s 5% today, but you expect it to go up as Well, it is going up. Yeah, [clears throat] yeah, yeah. As relative percent of the total training budget. Yeah, so so it I you know, I don’t know the latest stats on like a Meetha or like a 55, but you know, up until kind of basically when when we were scaling out 01, 03, you know, Codex, deep research, we basically saw like more compute you put into RL the better performance you get. So these things stack. Makes sense. Yeah. Um maybe a couple of other examples. I found that to be very very helpful. Are there other examples of domains outside of for example converting this menu to to a new dash that you could share with us of where this is a a particularly useful application? Yeah, yeah. So I mean we were we were talking about coding before. We recently just put a model in production with with Cognition / Windsurf. Um basically the idea is when you’re writing code and you write you know, you save your file, how cool would it be if you had a model that kind of ran sub two seconds, checked what code you wrote, and then told you if there was a bug in it or not. So you know, this is not something you can get with a general model because there’s this Pareto frontier of performance, cost, and latency. So if you take a really small model, you kind of push train the heck out of it on getting really good at this task of bug catching, you’re able to get the benefits of cost and latency and the performance of some of these larger sort of bigger models. Got it. And so the value add for you for for for Cognition and Windsurf is that they’re extending their product suite from just writing code, but also now do testing and and and and bug uh Exactly. Yeah, and and I think this gets into this really interesting idea of sort of model harness context code development, which is like, you know, you you never really can focus on just one layer. You know, a lot of these application layer companies are doing a ton of innovation on the harness and that’s actually how they’re able to squeeze value out of these these models, especially when it you know, relates to like the service that they’re providing. And then context is just like often times if you don’t have access to the right data, you won’t know the right thing to do. So, being able to plug into all of these different data sources inside of companies, that’s also extremely extremely important. Fascinating. So, this is in this case in the case of Cognition, it is it might be a true competitive advantage for them to ex- expand their product frontier. Exactly. Yeah, yeah. I mean, I I I think like you’re seeing uh people start to push the frontier on what these models can do. And And it’s usually an ensemble of models, right? Like general models extremely powerful, really good orchestrators, but, you know, fast sub agents or agents trained on proprietary data that’s out of distribution of these models like like something like this. Um those can be orchestrated with the general models and to create a really powerful system. So, like I think today actually, um you know, Ramp Ramp Labs uh the the you know, the corporate card company uh which you we’re actually really good friends with them and uh you you know, know a lot of folks there. They like trained this RL model um to basically do fast search inside of your spreadsheets. And that’s a way you can actually go and improve the product experience. Fascinating. Yeah. We’ll we’ll change topics, Yash, and talk about a bunch of these emerging model training techniques uh that we’ve been hearing about and maybe you can decompose those for us. Love to. Yeah. We’ll start with the continual learning. Yeah. What is it? A lot of smart people we know talk about that as the next frontier, including you. Yeah. Break it down for us. Yeah, so continual learning is is really that kind of what I what what I mentioned before, which is like how do you learn from extremely sparse rewards, right? So, um you know, like if you have a system deployed in production, uh how are you actually able to understand how that that AI model is being used, understand, you know, the downstream consequences of of of its actions, and then use that to update the system so it gets better over time. So, you know, what I what I have here is is kind of like two examples of of uh The red button. Yeah, two examples of what what I think like this this is starting to look like and to be clear, I think this is going to be a very gradual thing. So, um you know, a lot of continual learning is blocked on just having access to the right data. So, when you go and deploy an agent in production, like are you putting it in front of the right people to get feedback? Are you actually deeply understanding all of the context necessary to know um what good and bad looks like? That’s like just a data access problem. So, I I think this is going to look like kind of a slow gradual rollout rather than like, “Oh, there’s some like uh you know, extremely valuable insight that that someone comes up with.” But, a couple of examples of of how how you’re seeing this today is, you know, Cursor they have this model called Composer, which is essentially their um their own coding model trained on their coding data uh on top of an open-source model. And what they did was really cool. They basically took this model, uh had people use it in production, were able to capture a bunch of that telemetry, take steps online, so take a training step based off of like some implicit rewards that they uh that they sort of calculate. So, basically, they would look at like did the user accept this code suggestion or did they revert this code or did they for success. Exactly, yeah. And and they would kind of optimize towards it and they were able to see improvement by doing this sort of like online training where you’re collecting data, taking a train step, collecting data, training to train step. Then, this is something that we’ve been doing. Actually, just to follow up on Cursor, [clears throat] the x-axis here you have is steps. Would you convert that for us into time? Like order of magnitude, how much time did Cursor have to invest in improving Cursor’s performance? Yeah. For Cursor, it’s like days days, weeks or hours or It’s a good question. We’re I think we’re talking about days or days or weeks here. Um and then I think couple hours per step. Got it. I don’t know the exact uh terms, but um what one interesting thing here is like in in RLVR when you’re training offline, you have this like replayable environment where you’re um sort of like it’s the same task, it has a uh defined reward, and then you’re rolling out that sample, you know, hundreds or thousands of times in parallel. You can’t do that in production, right? Because people are using this you know, they have dynamic environments or whatnot. So, what they really experimented with was could we just take a massive batch, so like many many many conversations, denoise the gradient that way, and then take a step, and hopefully that’ll be like directionally the the way to improve the model. So, yeah, so that’s this is like um I think like hours per step, and then each of those steps is quite quite quite big or a lot of samples in them. Mhm. Um on the right here is is something we actually have been been working on, which is um this I this I we we we this idea of context-based, which is like can you actually go and use agents, expend compute offline to be able to go and analyze a bunch of documents, analyze a bunch of past traces that humans uh have had with agents, and extract learnings from that that’ll improve performance um downstream. So, one thing that we were able to see is like yeah, we were able to see uh at different reasoning efforts, a massive increase in performance um while sort of using the same amount of of of of tokens. So, yeah, so these are just kind of two examples, but I I I think like high level, you’re going to see innovations of at weight updates, context, and the harness itself to actually be able to capture this information. Fascinating. Yeah. More to come on this. Yeah. Um Second topic is non-transformer models. Yeah. lot talk in the class about Transformers not being a very efficient architecture, takes up a lot of power. Uh we had Ali Ghodsi, he said that well, look uh flying like the airplanes is far less efficient than like the birds do. Turns out we’re heavy a heavy specie, heavier than the birds. Uh is the Transformer like that in that it it it while not efficient will be the dominant way because the world’s infrastructure has morphed and moved along that way or do you think there’s a there’s a shot for a non-Transformer architecture like like the Mamba architecture or one other to be a dominant player in uh AI models going forward? Yeah, I think I think my honest take is like scaling Transformers is working and as you know, there’s a very simple recipe to be able to make these things smarter and better and um you know, probably more likely that the AI will tell us what the better architecture is if we just continue scaling it up than try to come up with one ourselves. Um you know, if there were if there was kind of a a wall in terms of you know, what this architecture would allow us to do then I think you know, we we we would see some um innovation there and there there certainly are is really cool research happening but my opinion is just like concentrate on scaling scaling Transformers. Very smart people on the other side of this debate as you know, Ilya, Yann LeCun, others going at it. Uh if you if you can share, do you know what the core insight is that that leads them to believe the other side of the debate and disagree with you? Yeah, so I think I think like um you know, the core insights are you kind of like what I said before is you don’t need pre-training levels of of of of data to be able to actually learn the underlying representations of language. I think Yann LeCun talks about this a lot where it’s like you know, humans don’t need that. Therefore, the architecture that we developed shouldn’t um require that. Um so I think that’s the underlying argument is just like first principles like you shouldn’t need it. Therefore, there must be a better solution out there. But I think like kind of to your point, the investments we’re making in our compute scale-outs. Um you know, I think there’s people who are actually optimizing for the architecture directly in the chips. Mhm. That is kind of I think a big ship to turn. And so far, what we’ve seen from the labs, which are going to control I think control a lot of this uh this build-out is is just investing more in the transformer architecture. They’re definitely doing research on like new new stuff, but um I think it’s it’s all experimental and you know Yeah. could work. Could work. What a time to be in the Big big large sums of money are going into these these techniques. We will see we’ll see how it shakes out. Yeah. Um rapid fire, last 5 minutes or so. Um you know, a lot of folks in the class deciding where to build, what part of the stack to build, what to start. If you were not If you had If you were not building Applied Compute today, Yeah. what would you be doing? What’s your next best idea? Yeah, so I’ve been thinking a lot about this cuz I think like what one thing that we’ve been running into at at Applied Compute and I know a lot of other AI companies are running into into is like scarcity of compute. Mhm. Um like I think uh the the the demand is just far outpacing the supply for for compute. And I think there’s going to be massive innovations in like the energy sources needed to power this compute and then also making more efficient chips themselves. Mhm. So, you know, not that I have a background in like hardware or chip design or anything like that, but I think we could be making way better hardware um to sort of optimize the co-development of training and um and and chip design. So, I I I would probably like look into hardware. Okay. Thank you. The long short game, uh pick a business, a product, a person that you like a lot, that you’re excited about. Yeah. Uh and the counterfactual, some something that is is more uh hype than there’s reality. Yeah. Yeah, so I think I think you know uh kind of kind of goes along with my last answer, like compute and the chip providers like Nvidia. I’m very long long them. I think they’re going to continue to win. They’re going to continue to supply all of the the um, you know, all of the labs and actually it is interesting though. Like once once you kind of look at the compute economics of of of Nvidia, you know, they take like a 75% margin on top of their chips. Uh, you have these labs spending hundreds of billions of dollars. Um, you know, question is like, hey, like maybe we take a couple hundred billion dollars and invest that in in actual like your own chip design, then we can do all the code development of uh, you know, model training architecture and chips internally and you know, maybe our chips are like 80% as effective, but we just make like, you know, 1.2 or whatever it is x more. Um, so I I think like that that is generally I think chips and you know, Nvidia is going to be like the the the sort of leader here. I think that’s valuable, but I do think there’s also a risk of like, hey, well labs are the biggest customers are maybe just going to in-house do this stuff themselves. Um, but it’s very hard. Um, one thing I’m a little less bullish on I think is like the data the data market is just really tough. Um, so you know, if you think about an RL task, um, you have basically like models that are not very smart. You train them to be very smart at particular tasks and then it becomes that much harder to go and create new tasks that you can actually hill climb on, right? So when we were Because because because the problem’s much harder. Exactly. You’re you’re you’re sort of getting squeezed where it’s like, okay, yeah, I’m selling to my customer and I’m improving their model, but I’m also making it harder for myself because the next time they come to me and say, hey, I want you to to go and build this task, I have to spend way more money. It’s going to take a lot longer, all that sort of stuff. And then I also think the models are just getting really really good. So um, you’re you’re I think you’re starting to see a lot of synthetic data generation, because if you think about an RL task, a lot of times what it is is exploiting some sort of generator-verifier gap, where like for code, right? You hold out the unit tests, you have the model attempt the task, and then you run the tests against the model’s output. That’s not something that you really need a human to do. You can The smarter your models get, the better pipelines you can build around synthetic data. So, I’m not like like I think data has been a thing that people have been like “Oh, it’s going to die every couple of years.” and it hasn’t. But, I do think it’s going to change, and the best The best founders in the data market kind of are just really good at pivoting, and sort of like, you know, what’s the next wave? Robotics data, egocentric data, like put on a you know, put on a GoPro and like collect a bunch of this stuff. So, I think like RL environments today is going to be tough, but it’ll get, you know, they’ll go on to the next wave of things. Awesome. Final question. Um your favorite AI product or favorite AI modality. Yeah. Oh, this is a good one. I I love GPT or sorry, image GPT-2. Image dual Image dual. Image dual. Yeah, I you know, when it when it came out, I was having a ton of fun with it. It’s like I think it’s massively like for people who can’t do design like me or who want to like look at things visually. Like a lot of times I’ll just take some paper or something and like drop it into Image Dual, and it gives me this nice visual and sort of like walk-through of how things work. So, that’s probably been my favorite AI product. an artifact. The most beautiful slide on this on this presentation was an Image Dual representation when I fed it the course syllabus. It was like “How about this?” This one. Oh, wow. And the signature for those of you who can tell is this one right here. It’s the Image Dual watermark. Oh, really? Oh, interesting. Anyway, awesome. Well, thank you so much for being here. No, thank you. We’ll continue this conversation and see how these bets play out. Let’s do it. Awesome. Thanks for having me.