Worlds Top Researcher On Ai Llms And Robot Intelligence
read summary →TITLE: World’s Top Researcher on AI, LLMs, and Robot Intelligence CHANNEL: Invest Like The Best DATE: 2026-03-31 ---TRANSCRIPT--- My guest today is Sergey Lavine, one of the co-founders and researchers at physical intelligence. As a disclaimer, I’m an investor in physical intelligence because I believe it’s one of the most important companies tackling the problem of robotics. As you hear us discuss today, robotics has what I would call a scarecrow problem. All of these amazing physical devices are becoming ever more possible in all sorts of cool permutations, but what they all really need is an intelligence, a brain, and that is what they’re developing at physical intelligence. They are trying to develop foundation models that can make any physical robot do any task in any environment. That challenge is daunting and has required many of the world’s best researchers, Sergey as one of those leaders, coming together to try to solve this problem. The nature of our conversation today is all of the problems facing robotics and all of the promise of solving these problems across the world. I hope you enjoy this great conversation with Sergey Lavine.
This is going to be a real treat and a blast to learn about possibly the most exciting impactful area of technology being developed. Just to set the stage before we go back in time, maybe you could just define physical intelligence as you see it.
Fundamentally, the goal of physical intelligence is to develop robotic foundation models that can control basically any embodied system to do any task. But broadly speaking, you could imagine that in the same way that a language model is kind of rapidly evolving towards a system that can do any task that can be expressed in language, what we would like is to build a new class of models that can do any task that can be done by a physical actuated device.
And part of the thesis of this company is that we believe that doing it at the full level of generality might actually in the long run be easier than trying to special case very specific narrow application domains. Again, in much the same way that for language models, it turned out to be easier in some ways to solve natural language tasks in their full generality than to narrowly target like machine translation or sentiment analysis or whatever.
That may not be obvious why you would make that bet versus a robot that just does your dishes or something. So what are the key trade-offs to understand and why make the decision that you made?
Maybe I can give you like a two-part answer for this. First is how it relate kind of the analogy to language models and the second one is what that means in the robotics world.
So the first one is kind of a little bit more informed by evidence. In the world of natural language we saw that there were a lot of efforts to develop domain specific solutions that tackled specific problems like you know somebody would spend a lot of time thinking about how like English differs from French and then build a machine translation system. The reason that language models actually took over for all of those different application domains is because they can leverage much broader sources of data. And it’s not even as simple as saying like, oh, we had this data for this application, this data for this application to like merge everything. But it’s actually more than that. It’s when you can leverage weakly labeled data like data that you like, you know, in the case of language models, you just mine from the web, you actually learn more about the world. So you establish like foundation of world understanding and then on top of that foundation turns out to be much more effective to build out different applications.
So to bring this into robotics obviously the calculus doesn’t look quite the same because in robotics we don’t have like an internet size data set that we can just draw on. But this notion of understanding the world if anything is actually more important in robotics because if you have many different tasks maybe even many different physical systems then you can go from training individual dishwashing specialists or laundry folding specialists and instead train a model that actually understands physical interaction. Like people can master new skills very very rapidly because we understand physical interaction we can intuitively grasp like what’s going to happen in this new unfamiliar situation and let us bootstrap things really really quickly. So if we can draw on data from many sources, many applications, many robots, then we can have a model that has a physical understanding and then it’ll be much much easier to put new applications on top of that platform.
What is the hardest part about building in this way? When you see other approaches that are more maybe legible to the average person — oh, there’s a robot moving around doing this one specific thing, it looks a certain way — what’s the hardest part about this approach?
I think this has actually been kind of an issue in my whole career because when you work on robotic learning and the more general the more that this becomes important is effective robotic learning, effective generalization isn’t actually the optimal way to have like a really exciting demo. Like the way to have a really exciting demo is to pick a really cool task, control everything else in the environment, set it up so that it’s perfectly clean, perfectly pristine and just make it work in that one setting. And generalization, you can’t just show it in one spot. The point of generalization is that it does something relatively mundane that any human could do, but it does it in any situation.
So, we had some demos that we released last April where we showed our robot cleaning kitchens. And I think it’s kind of cool but if you watch an individual video out of context it’s just like okay it’s picking up plates like anybody can pick up plates except that we just put it into that home just for that demo and it never had training data from that setting. So obviously you kind of have to understand what’s going on to appreciate why this is actually pushing the frontier.
What is your model for the stakes of what you’re doing? Like if you are successful, I’m curious for you to define what that would mean.
One of the things that I think would be really really exciting that would be enabled by a general purpose embodied foundation model is the ability to unlock people’s imagination in how they build robots and other embodied systems. So like personal computers were a really big deal in my mind because it made it possible for lots of people to hack together all sorts of really cool stuff and there was this Cambrian explosion of amazing applications that started in the ’90s and was further accelerated by the internet.
And I think something like that might happen in the world of robotics but it can’t happen today because if you want to put together some cool new robotics application some cool new robotics idea you kind of have to build this monstrous stack and you need to basically solve the intelligence problem.
But if there is a solution that someone can build on top of, there’s a foundation model that you can prompt that’ll provide basic functionality and then you can maybe fine-tune it a little bit or adjust it in some way to your application. Now, it actually makes it a lot more tractable for lots of people, lots of companies, lots of individuals to try out all sorts of different things.
And I think we sometimes think that robots are going to be like one thing — just like, you know, metal people. But I don’t think that’s how it’s going to be because no technology has been like that. It’s going to be more like a toolkit where you can put together all sorts of really cool applications. Get really creative with it. Maybe I’m going to make a robot with five arms and this one’s going to hang from the ceiling. But you need the right platform on top of which to do that. And I think the foundation model can be that thing.
What are the pros and cons of the humanoid approach to robotics?
One pro is that it’s really cool. You can show it to somebody and they get it. There’s a lot of value to capturing the imagination. But I think it’s one of many possible kinds of robots that we’re likely to have. And fundamentally the intelligence challenge looks very similar for all these different robots. I don’t think we should be tackling intelligence in the context of one specific body. I think we should handle it in a general way because otherwise it’s just really hard to get a handle on this. We need lots of data. The cool thing about being able to build robots is that ultimately they don’t have to be constrained to look like humans at all. You could imagine that you’re building a house with a robot that is a swarm of 10,000 quadcopters.
I think that in the future we’ll have a robotic foundation model which can then be adapted to all sorts of applications and it might really run the gamut from bulldozers to humanoids to robotic arms. And maybe it would need to be adapted to each one, maybe fine-tuned, maybe we would need something in context to understand how that body works. But the fundamentals of how you interact with objects, how things move in the world, how causality works — that’s all conserved for all of these different systems.
There are a few things worth thinking about. One is that we can make machines that are very big and machines that are very small. In the long run, there’s lots of really exciting applications in medicine, in surgery, where we might not only not be limited to robots that look like humans, we might not be limited to robots that can even be controlled by humans. Currently in robotic surgery it’s done entirely through teleoperation. But in the long run we could imagine addressing that.
If you think about the most important hashmarks on the timeline of robotics research — at some level doing end-to-end control for robotic systems is a very old idea. The first autonomous driving systems that used end-to-end learning existed in the 1980s. Alvin was I think 1986 or 87 and that was a driving system demonstrated to drive on highways controlled by a neural network.
I think that there are some very venerable concepts but historically what has been really difficult in robotic learning is that you need a system that handles the application you want to address, that is cost effective to train, meaning you don’t need a huge amount of data for every single application, handles long-tail scenarios with common sense, and is robust fast and reliable. Getting all those things together is very hard.
Being able to train general purpose models that can handle many tasks is essential because now you need a lot less data for each new task. But then even further, you also need to handle the unusual scenarios. For the unusual scenarios, you are probably not going to have experience. What you need to rely on is knowledge that you’ve acquired from other sources that you can ground in that new situation. And people are extremely good at this. If you’re driving a car and there’s something going on in the middle of the road and someone put up a sign saying don’t go here, there’s a gas leak or something — you’ve probably never experienced that before, but you can put these things together and figure out what you’re supposed to do because you have common sense. And this has been a huge mystery in robotic learning — where do you get that common sense?
And this is what’s changed in the last few years because turns out that multimodal language models are really good at pulling in knowledge. They’re not very good at grounding that knowledge in physical situations, but they know stuff. So now there is a path to get that kind of common sense by essentially leveraging the knowledge that is contained in multimodal LLMs. But there’s also a challenge because you have to somehow plug into that knowledge in the right way.
I think it’s very early right now. Certainly the first end-to-end learning systems in the 80s, that’s definitely a milestone. The first deep reinforcement learning systems in the early 2010s are probably a milestone because deep reinforcement learning gives us a way to go beyond human level performance which will be essential for robotic systems. And then the advent of multimodal LLMs that can be adapted to robotic control to bring in common sense — I think that’s a really important advance.
I started working in robotics in 2014 after finishing my graduate degree and started a postdoc with Professor Peter Abbeel at UC Berkeley. I actually hadn’t worked on robots before. Before that, I worked on computer graphics. The thing I’ve always wanted to figure out is how to get AI systems that get better and better the more they do things. Initially I tried to approach it in a blank slate way — you start with nothing, practice a particular skill and get better at it. That was okay in a limited setting but very hard to turn into a general system.
The next thing I tried at Google was to see if we can parallelize it across many robots. Collective learning — put 20 robots in a room and have them all learn together. That works and it generalizes but it’s very hard for that to handle tail cases.
The next step is combining this ability to practice skills with lots of prior knowledge. That’s actually a really hard problem. Arguably the two big impressive results in AI over the last few decades have been generative AI and deep reinforcement learning. Generative AI is impressive because it can reproduce some of the things humans can do. DRL is impressive for the opposite reason — it does things that humans hadn’t thought of, like move 37. So the big challenge is to combine those threads — how to bring in all of that knowledge from generative AI but also go beyond just human level performance with reinforcement learning.
So what literally have we done? We started by developing the basic foundations — a vision language action model. You can think of it as an LLM that has been adapted for robotic control. They’re first trained on text data, then adapted with lots of image data from the web to understand images, and then adapted to robots with lots of very diverse robot data.
From there, we studied two threads: how to get this thing to handle unusual situations with common sense, and how to get it to improve with reinforcement learning. The way you get common sense is by essentially using chain of thought. The robot enters a scene and instead of directly starting to move, it thinks about what it was asked to do. If it was told clean up the kitchen, it looks at the scene and says “okay based on this I should pick up the plate.” It literally talks to itself. That unlocks all this prior knowledge because those intermediate inferences benefit from the web-scale pre-training. And the reinforcement learning part comes in after you’ve practiced it a few times and you can keep getting better through experience.
About sensors — you can actually get away with less than one might think. This platform has three cameras, one on each wrist and a base camera. No touch sensing, no force sensing. Very bare bones and low cost. A good learning method can compensate for deficient sensing fairly well. The wrist cameras are essentially a touch sensor in disguise because you can see local deformations when you touch something.
About the data reservoir — I don’t think anybody really knows how much robot data is needed to have truly generalizable embodied AI, but my sense is that we actually don’t need to know. What we need to do is get to the point where these systems are useful enough that they can go out into the world and gather more data themselves. Tesla doesn’t worry about how much data their cars can collect. The key is to get a system that can go into the world, that’s useful enough, does a wide variety of different things, and can keep pulling in more data.
What has been the most surprising thing? We’ve made a lot more progress on dexterity than I thought we would. What was surprising is that we could get these systems to perform very dextrous behaviors without doing anything particularly special. The same also applied to getting systems to work on different embodiments — we could get our models to work on all sorts of other robots, including robots with multi-fingered hands, different numbers of degrees of freedom. The model itself didn’t need to change. It didn’t even need to be told through any kind of prompt what the robot was.
On Moravec’s paradox — we have a cognitive bias to think that things that are easy for us will be easy for the machine. Solving calculus problems is difficult for most people, but picking up a cup is easy. We think machines should be able to do this, but it’s actually the other way around. We’re very good at spotting the tiger in the jungle because the people that weren’t got eaten. Machine learning slightly changes that equation though. Getting a machine learning system to pick up any cup anywhere, if you have data for it, is actually not that difficult. Increasingly, domains where collecting data is straightforward will fall into the easy bucket over time, even if they are physically intricate.
For the purpose of robotic learning, we can think of common sense as applying semantic inferences using knowledge learned from other domains to the current physical task. You can think of it as the opposite of muscle memory. Common sense is when you know something to be true because you saw it or read about it and now you are in a situation where that fact is highly pertinent and you are able to make that connection, apply it grounded in the environment, and make the right decision.
On long-range tasks — we found maybe about 6 months ago that our models had gotten to the point where they could be improved just from supervising them with high-level instructions. You take a robot, put it in a new kitchen, ask it to clean, it fails somewhere. Traditionally you’d add more teleoperation data. But we tried just adding more data labeled with the semantic command — basically take whatever the robot experienced and label it with some semantic commands but don’t add any more low-level actions. And that actually helps. The bottleneck had shifted from the robot’s ability to physically do the task to this middle level where the system is more bottlenecked by its ability to interpret the scene and select the correct next step, which can be supervised with language. That means someone can literally talk to the robot — coaching, basically — and make it better just by talking to it.
On real data versus simulation — it’s a very controversial topic. If you look at humanoids doing acrobatics, there’s a pipeline that is very heavily reliant on simulation and actually very light on real world data, often zero. Then approaches that work well for robotic manipulation are the opposite — very little simulated data, large amounts of real world data and very large foundation models. It is surprising that in these two robotic domains, the dominant approaches look so different.
About the Robot Olympics — Benji Jang wrote about a robot Olympics centered around everyday tasks that people find easy but robots struggle with. Opening a door, washing a frying pan with grease, using a plastic bag to pick up dog poop. We tried these things and could solve almost all of them. We didn’t get turning a dress shirt inside out because the grippers wouldn’t fit inside the sleeve. And on a technicality we didn’t succeed at peeling an orange because the fingers weren’t strong enough — we had to use a little tool. But everything else we could do. And we didn’t develop anything special for this. We used it as a test of our task onboarding process.
On superhuman abilities — we were working on a task where a robot had to plug in cables. When a person does this, they pause frequently because they have to cognitively process what’s going on. If you’re teleoperating it’s even slower. It turns out to be pretty straightforward to find all those pauses and remove them. You can get a task where a person demonstrates what it means to succeed and then have the robot practice and succeed in the same way but a lot more quickly, a lot more efficiently. The most general way to do this is with reinforcement learning.
On form factor innovation — in general in robotics the ability to innovate on form factors has been very constrained because of the AI challenge. If you could just put together a robot in your garage, load up a robotic foundation model and tell it to do stuff — maybe it won’t be perfect, maybe it needs more data, but you can at least get the thing moving. I think that can be a really powerful engine to get everybody to experiment.
On physical intelligence and embodiment — there were studies done in monkeys using tools where you can find which neurons activate for the monkey to figure out where its hand is. It turns out that if it’s using a tool, they activate based on the location of the tool tip, not the hand. The tool being an extension of your body is a real physiological thing. Physical intelligence should be at some level agnostic to embodiment. A good foundation model should figure out how to manipulate whatever body it’s controlling. There wasn’t a humanoid problem and a car problem and a bulldozer problem. There was one problem, and if you solve it at full level of generality, that’s really powerful.
On the bitter lesson and controversies in robotics — in the early days the main argument was about whether learning has a place in robotic AI. It took a really long time for the community to internalize that you don’t necessarily need to program in knowledge of physics. There’s still not universal acceptance that end-to-end learning is the right way to go. The bitter lesson says you should not program the machine to think the way you think it should think, but let it learn from data. The best steel man against: if you want something reliable in a really complicated open world setting, you can’t afford not to use what you already know about the physical world.
On compositional learning — a student asked a language model to provide a recipe for how to make a sandwich in International Phonetic Alphabet. IPA only ever appears for individual words in a dictionary. You never see free form text written in IPA. But a good language model will write paragraphs in IPA. That is compositional generalization. You can imagine the same thing in robotics — you’ve learned a repertoire of skills and now you can combine and mix those skills to solve new problems.
On the last tasks robots will achieve — I think changing a child’s diaper will be really really hard. People are extremely good at interacting with other people. Elderly care, taking care of small children — those things are going to be hard and they’re probably going to be harder than people think. The stakes are very high.
On physical analogies and LLMs — people are remarkably good at using physical analogies to understand other situations. You could say “that company has a lot of momentum” — that’s a physical analogy. Richard Feynman talks about analogies regarding subatomic particles — we use the word “spin,” the thing is not really spinning like a spinning top, but those analogies help us make sense of it and actually lead to valid inferences. We are so primed to have physical intelligence that you can use it in everyday speech and when advancing fundamental theoretical physics. I don’t know if LLMs can do that.
On preparing for robotics — a big uncertainty is whether robots will rely more on demonstrations or on reinforcement learning from autonomous data. How somebody should prepare will be pretty different depending on whether they need lots of teleoperation versus a tiny number of demonstrations and huge amounts of autonomous experience.
On robotics and labor — coding tools are a nice template. It’s not like they came on the scene and suddenly we don’t need software engineers. They increase the productivity of individual software engineers. A more realistic template for robotics is not the humanoid goes in and the people leave. It’ll be more like some aspects of the job done by a robot, some with a robot working together with a person. It’ll be this kind of dance that we’ve seen with coding tools.
On hardware costs — when I started working in robotics about a decade ago, the robot cost about $400,000. When I started my lab at UC Berkeley, it was around $30,000. Now each arm on this thing is maybe a tenth of that. And those low-cost arms wouldn’t be useful in an industrial setting with traditional control methods. But with learning-based approaches, they work.
On what we’re working on next — a big focus is better understanding mid-level reasoning. We have a good sense for how to acquire low-level physical behaviors but getting those to generalize requires common sense knowledge. LLMs make certain kinds of representations convenient — turning text into other text — but that’s not necessarily the best representation for what an embodied system needs to do. Sometimes it needs to think about things more spatially, sometimes semantically.
I’m on the optimistic end when it comes to established robotics researchers and on the pessimistic end relative to robotics entrepreneurs. Robotics has a very long history with precious few successes. Most robots doing useful work are still running state-of-the-art technology from the 1980s. But I can see a lot of the puzzle pieces that could be slotted in to address many of those things. As my co-founder Carol likes to say, when you’ve climbed the mountain, only then do you see if there’s another mountain after it. And in robotics, there’s been a lot of experience of lots of mountains.
On inspiration — I am quite inspired by Boston Dynamics. There is a lot of value in repeatedly showing something that people wouldn’t have thought possible. I’m also inspired by organizations that create an atmosphere for experimentation. OpenAI has historically done a great job of this. ChatGPT was basically John Schulman’s pet experiment for a while — it wasn’t a concerted corporate strategy with lots of spreadsheets and pie charts.
On kindness in his career — when I started at Google, I was absolutely shocked at the level of leverage I felt I could have. We took a couple dozen robots, put them in a lab, and had them collect data — “the arm farm.” I found out somebody had a warehouse full of robots nobody was using, and I asked Jeff Dean if we could stick them in a lab. I was just a level four research scientist. And Jeff was like, “Yeah, let’s do it. What do you need?” I also think when I started my postdoc with Peter Abbeel at Berkeley, I had zero robotics experience, and that was a bet on my potential more than my actual accomplishments. These kinds of things really matter in a person’s career.