heading · body

Transcript

2026 Lemley Lecture Featuring Ai Pioneer Yann Lecun

read summary →

That’s great. Wonderful to see a packed room here. Very excited. Uh for those of you I haven’t had the pleasure of meeting, I’m Frank Doyle. I’m the 14th Provost here at Brown University, where I’m also a professor of engineering and a professor of neuroscience. And it’s my great pleasure to welcome you to our annual Lemley Lecture. And I’m going to give you just a quick um explanation of what the uh lecture series is. This is the Lemley Family Leadership Lecture Series, and we invite exceptional leaders who are preeminent in their field to come to campus, engage, and inspire the university community. We established this in 2020 through a generous gift by made by Wayne C. Lemley, Class of 1980 Brown PhD. And our series features highly accomplished scholars, thought leaders, policy makers, and practitioners who will inform, educate, and challenge, advancing Brown’s commitment to promoting a vibrant intellectual community. And to introduce today’s speaker, I’d like to call up to the podium our Associate Provost for AI, Professor Michael Littman.

[applause] Thanks, everybody. I’m so glad everybody could make it today. Um I I I actually didn’t think this was going to work, that you’d see right through that this that that Yann is not actually here. This is uh as an April Fools’ joke, we just thought it’d be really funny to get everybody to come out here. This is uh CS 1413 Advanced Prompting for LLMs. All right, so uh first slide No, okay, all right, that that was a joke. All right, so uh Hi, everybody. As Frank said, my name is Michael Littman. I am our He was the 14th Provost, which seems like maybe not enough Provost given how long Brown has been around, but I’m the first Associate Provost for Artificial Intelligence, which seems like maybe that is enough uh Associate Provost of Artificial Intelligence. But I But I wanted to to give everybody a chance to kind of get to know a little bit about Yann before he comes on and and tells us about his work. So, um I actually first heard Yann speak in 1988 when he was just 1 year out of his PhD. In his graduate work, he developed the mathematical underpinnings uh for backpropagation. That’s the algorithm that actually makes it possible to train these giant neural networks, or then known as connectionist systems, that underlie chatbots and thousands of other recent innovations today. So, not everybody knows that Yann deserves credit for this, but do because that’s what his talk was about in 1988. Um and so I I it was really cool to learn about that sort of in the moment, like while this stuff was actually happening uh in the early days. That was like the second wave of neural networks. We’re now in what we could view the third wave of neural networks. So in the years since I’ve crossed paths with Jan many, many times. Uh I’ve seen him give many invited talks, all of which have been equal parts inspiring and impressive. The inspiring part is that he’s a brilliant researcher and his ideas are incredibly deep and insightful. He presents new approaches that are just so cool that you just have to go off and try them yourself, right? So you leave you leave off at at least when you’re in a room full of computer scientists, they everybody wants to run off and like, is that true? Can I actually try that? The impressive part of his talks is that he actually is a consummate tinkerer and he often creates live demos to accompany his talks. And if you’ve ever done a live demo, it is a very dangerous thing to do and I’ve seen him do it multiple times. Maybe we’ll see one today? He says no. All right, so that’s which I think is the right thing to do uh because they can go very badly. So now if you’ve have followed Jan online, you might have seen that he has very high standards. He will sometimes complain about the status quo, but he’s also very, very generous. Just as just as often he points out uh just as often as he points out the flaws that he sees, he also creates solutions. So for example, if he’s dissatisfied with a piece of technology, he’ll create a custom-built alternative. And um it may be in the in the case of uh creating his own programming language or his own document and image presentation software, like they’re really high-quality stuff. And not only does he create this for his own use because he doesn’t like what’s out there, but he provides it for other people to use as well. And often that’s set up as open software, something for for everybody to share and everybody use because that’s who he is and that’s what he believes. Throughout his career, he has collected a wide variety of distinctions and I’d like to mention two that might be relevant to the talk, but I’m not sure what the talk is on. So we’ll see. Uh the first is that he is a Turing Award winner. So that’s the closest thing that computing has to a Nobel Prize. And the award cites his conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. And that was very, very true when he won the award and I think it’s even more true today. I was actually lucky enough to be on hand for the ceremony, which was very, very cool. And so arguably, winning a Turing Award puts him at the top of academic computer science. But he’s now also atop the business world. As of 3 weeks ago, he is the founder of a unicorn company. So that means that it’s valued at over a billion dollars and like multiple times more than a billion in this case. The company’s called Ami Labs and maybe we’ll hear a little bit about that. The name of the company stands for advanced machine intelligence, but it’s also the French word for friend. It’s very clever. So we are absolutely delighted that Jan accepted our invitation to speak to you today. I don’t know what he plans to speak about, but I am confident that he will give you something to think about and talk about for days to come. So let’s welcome Jan LeCun. [applause] Thank you so much, Michael. So one thing you have to realize is that uh I’ve become somewhat famous or infamous for bashing a particular I’m going to go back to the title slide. Bashing a particular aspect of machine learning um called reinforcement learning that Michael is actually one of the you know pillars of. And so he’s saying all those nice things about me despite the fact that I’ve been sort of dissing a lot [laughter] of the field he’s been he’s been working on, but in a nice way. Yeah. Not worried at all. Yeah. Um Right and uh we uh overlapped in New Jersey uh Michael and I uh for a few years when he was at Bellcore and I was at Bell Labs. They were kind of sister or brother labs. I’m not sure how to say that. And later when uh we both joined AT&T Labs, uh we worked on on AI although in different locations, but uh in the late ’90s, AT&T was probably one of the strongest place anywhere to work on to work on AI. So it’s not a completely recent phenomenon that uh you know, industry research on AI is uh I mean, industry is kind of a dominant force in AI research. Okay. So um this is uh a session where I’m supposed to give a short talk. I’m going to attempt to do this. I’m not familiar with the concept, but uh so that we can have a discussion afterwards as far as that chat and uh question and answers. Okay. Um so first of all, thank you for inviting me. Um always a pleasure. And I’m going to talk about world models and about the next generation AI systems. And start by uh saying a slide that you just saw because uh Michael inadvertently flipped the slide. And it’s like AI sucks. Um you know, Michael Michael said sometimes that you know, I have controversial opinions. That’s that’s potentially one of them. Um so we have AI systems that can write code, they can pass the bar exam, they can win international math olympiads, but where is my domestic robot? Where is my robot that they can, you know, clean the house, learn to drive in 20 hours of practice like any teenager? Um we have systems that can manipulate language and they fool us into thinking they are smart because they manipulate language. But in fact, they are completely helpless when it comes to the the physical world. They can’t really handle LLMs in particular cannot really handle high-dimensional continuous data, particularly if it’s noisy. Think video or any kind of sensor data. Um Images, video, audio, sensor inputs from you know, measurements, whatever. Financial data, scientific data. Um [clears throat] And the result is that uh everybody is these days in AI is talking about agentic systems, so systems that can produce actions in the world. And almost none of those systems at the moment um are capable of predicting the outcome or the results, the consequences of their actions. It’s a very bad way to produce an action to not if you’re not able to predict the consequences of it. In fact, it might be dangerous. And so I don’t know for the life of me why people believe that they’re going to be able to build agentic system by just training them to imitate uh humans to accomplish tasks without those systems having the ability to predict the consequences of their actions in advance. And that’s basically what world models are all about. Um The other question is why can’t AI systems solve new problems they were shot? If you ask them to accomplish a task they’ve not been explicitly trained to accomplish, they’re not going to be able to do it. Um so all the robots you see doing amazing things today, they’ve been trained to imitate humans with enormous amounts of teleoperation data. Or you see robots doing, you know, kung fu dances or whatever and that’s super simple because the only thing you need to know is the dynamics of the robot. You don’t need to understand how they interact with the world and that’s where things become complicated. Why can’t AI have common sense and understand the real world? Like not like human, like your cat. So never mind humans. What about cats and dogs? They can do do amazing things. They understand the physical world. They have some level of common sense. Certainly dogs. Um And you know, robots don’t come close to this. Um as I said, 17-year-olds can learn to drive in a few hours of practice. And we have millions of hours of training data of cars being driven by people. We could use that data to clone the behavior of humans, train a neural net to basically behave like a human. Yet despite that, we don’t have level five fully autonomous self-driving cars that are as reliable as humans. We do have autonomous driving cars, they’re not level five. They can drive themselves pretty reliably, but they can also call home when they’re stuck and their uh zone of operation is limited and they use all kinds of sensors. Uh and they require 15 years of very careful engineering to actually kind of do anything. I’m talking about Waymo. A big success of engineering, but uh but hardly learning to drive in 20 hours of practice. So we’re missing something big in terms of learning to get machines to learn as efficiently as humans and animals. Um and we keep bumping into this paradox that uh Hans Moravec uh robotics in 1988 formulated. There are things that are easy for humans and difficult for AI. Um and for a it’s they don’t seem natural. The things that seem easy for AI are complicated for humans, like playing chess. Uh but things we don’t even consider uh intelligent actions, like uh grabbing an object or manipulating one, um we can do with with AI yet. Um the real world is messy and the people who think that by training LLMs and making them bigger and training them on on more language, uh we’re going to reach human intelligence are deluded. Okay? Uh here’s another controversial statement. Okay? The entire there’s literally hundreds of billions invested in an industry that is in basically is counting on the fact that LLMs is going to reach human level intelligence. It’s complete BS. Um it’s useful, right? I mean, it’s not like those hundreds of billions are wasted. But we’re not going to get human intelligence by just getting up LLMs. It’s just not happening. Um so, how do humans and animals learn? Um we learn a lot by observation. Of course, quite a bit by interaction as well. Uh and by imitation, by looking at our parents and and other humans doing things. But some animals learn without actually the help of other members of their species and certainly not their parents. Think octopus. They never meet their parents. Think cuckoo birds. Never meet their parents, either. Um so, humans uh and animals learn mental models of the world. And in fact, cognitive psychologists have tried to measure the type of uh concepts that humans are young humans, infants, are capable of learning in the first few months of life. So, concepts, very basic concepts like object permanence, the fact that when an object is hidden behind another one, it still exists. We’re not born with that. Babies learn this around the age of two to three months. Um what’s kind of interesting is that notions like uh intuitive physics uh like conservation of momentum, gravity, that takes about nine months for a baby to learn. So, if you show a a six-month-old a scenario at the bottom of the of the slide here where a car is put on a on a cart and is pushed off the uh on the on the platform and and is pushed off the platform and appears to float in the air, a six-month-old baby will barely pay attention. Whereas a 10-month-old will look at it with big eyes and re-fixate um because, you know, his or her world model is being violated. Um so, I I wrote kind of a bit of a a co-wrote, I should say, a bit of a vision paper about about this with Emmanuel Dupoux, who’s a cognitive scientist who actually put together that chart, and Jitendra Malik, who’s uh computer scientist and roboticist at Berkeley, not Amazon. Um so, how do we reproduce this type of learning and what is intelligence really, right? Um there’s all kinds of things that are being said about what it is to really be intelligent. And you would think that uh you know, with the success of LLMs, you know, LLMs are good uh sort of interpolating information which we call machines. If they are trained to answer a question, they will answer that question appropriately. They can solve certain problems because they have some special mechanisms that are built in when a particular problem is formulated. Um but they’re not very smart. They can accumulate a huge amount of knowledge. And the reason why they require a large number of parameters and and large memory is because they’re based on the accumulation of uh factual or uh um factual knowledge, essentially. Um uh not understanding. Not much, at least. So, intelligence, in my opinion, is the ability to accomplish new tasks you’ve never been exposed to and solve new problems without any prior training or with minimal training. Like learning to drive in 20 hours of practice, not millions of hours of of examples. So, fast adaptation in front of new situation is intelligence. And it’s not just adaptation, it’s also creativity and, you know, the sort of extensions of adaptations, if you want. Um Now, the phrase artificial general intelligence or AGI makes absolutely no sense. Uh and the reason is because it’s this is designed to designate human level intelligence, but human level intelligence human intelligence is highly specialized. We don’t really realize that because all the problems we can apprehend are problems we can apprehend. Uh but it’s you know, there’s there’s no question that we are very highly specialized. Now, there’s no question, either, that machines will eventually surpass humans in all domains where humans are intelligent. That’s not going to happen next year, despite what Elon Musk has been saying for the last 15 years. Um it is going to happen next year. Um and it’s not going to happen in two years, despite what Dario Amodei has been saying. Um it’s not going to happen right away. Um at the very best, we might be convinced that we’re on a good path towards human level intelligence, but not yet at human intelligence within five years. Um but it’s going to take a while. And it’s almost certainly much harder than we think because in the past 70 years of history of AI, it’s always been much harder than we thought. There’s been about four or five waves, and Michael and I have been witness to this uh over the course the of the history of AI of people who said, “That’s it. We’ve discovered the secret of intelligence. Within 10 years, the most intelligent machine uh on the planet will be a machine.” And that proved to be wrong every single time. It’s going to prove to be wrong this time around, as well. Um I I wrote a little I mean, a co-wrote, really. Um I didn’t really write it. It was mostly written by the co-authors. Um this little piece, a philosophy piece about uh you know, intelligence and really you know, how how how we should think about human level or superhuman intelligence. Now, um why is it that human level AI will require real-world data, sensory inputs, as opposed to just language or text? Okay, take a LLM uh which is trained on all the publicly available text on the internet. That’s typically something like 10 to the 13, 10 to the 14 words. Um in the case of uh Llama 3, which was a a model that my colleagues at Meta trained uh a couple years ago, it was about 2 10 to the 13 words. You turn this into tokens. Uh token is a subword unit, so that’s 3 10 to the 13 tokens. A token is 3 bytes. So, you get roughly 10 to the 14 bytes to train your LLM on, pre-trained. Um that would take about 400,000 years for any of us here to read through that text if we read 9 hours a day at 250 words per minute. Okay? A lot of data. Now, take a human child, a 4-year-old, trained on visual uh sensory data, 16,000 16,000 hours of of uh wake time in the first four years, which by the way is a small amount of video data that would be about 30 minutes of YouTube uploads. Um we have 2 million optical nerve fibers going to the visual cortex from the retina, carrying each about 1 byte per second. So, that’s about 10 to the 14 bytes in the first four years. Instead of 400,000 years, it’s way more information getting to us through sensory inputs than there will ever be through language. And that tells you we’re never going to get to human level AI by just training on language. It’s just not going to happen. Despite what you might hear. Um Now, of course, you know, video data is much more redundant than language. But that’s the point. The fact all the machine learning procedures that we use rely on redundancy. If you try to train a machine learning system with completely random data, good luck. So, um what I’ve been working on with some of my colleagues over the last uh five years, but really in the last 15 years, um and recently created a company around around this idea is uh basically enabling the next AI revolution, which would be AI for the real world, uh using what’s called self-supervised uh learning, um which is self-supervised learning for high-dimensional continuous noisy data, such as images, video, audio, sensors, etc. And to produce um AI systems that understand any environment or system that they need to control, including physical things, or they understand the physical world, but also living things. Systems that have persistent memory. Systems that can perform long chains of reasoning, can plan complex action sequences, possibly hierarchically. And systems that can accomplish new tasks zero shot without prior training and adapt quickly to uh new situations. And finally, systems that are controllable and safe. Uh current AI systems, LLMs, are intrinsically unsafe. It doesn’t matter too much because they’re not very smart. And, you know, we learn pretty quickly how to use them. Um but they can’t they can’t make they can be made completely reliable. Okay, so let me go into the the weeds of it. Um there there are two ways to build agentic systems. One is the way LLMs are built. You give them uh what’s called a context or a prompt, right? They observe something, which is a sequence of symbols in the case of LLMs, and you train them to predict the next symbol. Um if those sequences of symbols represents, you know, uh configuration of uh of uh of the of the world, and then an action, they will attempt to predict uh the action. There’s no reasoning in that. The way you you trick an LLM or an autoregressive prediction system to reason is that you trick it into producing lots and lots of output tokens. And you hope that because it’s going to spend more competition producing many tokens, it’s going to turn into some form of reasoning. So, you train it to go through steps of a reasoning. And and that’s what they call reasoning, but it’s not really what reasoning is. Reasoning is a search. When we reason, we search for a solution to a problem. In fact, that’s the way classical reasoning in AI has been formulated since the 1950s. I give you a problem. The problem whether a particular hypothesis can be a solution to this problem may be characterized by a function. So, let’s say your problem is to find the shortest circuit of a traveling delivery truck through through cities, the traveling salesman problem. It’s a well-formulated problem. There’s a well-formulated set of possible hypotheses, which is all the possible orders in which you can go through the the cities. And then the the function you need to minimize is the length of the path. And so, you just search for a solution. Now, in the in 1958, Newell and Simon, pioneers of AI, wrote a program that they very modestly called the general problem solver. And it was basically that. If you can characterize a solution to a problem and you have a search space for all the possible solution, the only thing you need to do is search through the solution for one that optimizes that that function. They didn’t realize yet because computer science was not invented as a field that most problems are exponential or NP-complete or NP-hard. Um So, but this concept that inference is a search through a set of potential solution for a solution that minimizes a particular objective function, that is intrinsically more powerful than auto-regressive prediction of tokens. Computationally, it’s intrinsically more powerful. You can reduce any computational problem to an optimization problem. You cannot reduce every computational problem to or at least not efficiently to an auto-regressive token discrete token prediction problem. Okay, so we have those two categories of AI systems. My money is on the second one. Okay, the one that performs inference by search. Um and that plus a couple of other ideas that I’m going to tell you about led me to this long-term vision of kind of a cognitive architecture whose centerpiece is a world model. So, um so, if you have a world model, what is a world model first of all? A world model is a a system predictive system that given the current state of the world or some representation abstract representation of the current state of the world and given an action that you imagine taking or intervention, can you predict the next state of the world that will result from this action or intervention? If you have such a world model that predicts what the world is going to be after you take an action, you can use that for planning. You can feed the resulting predicted state to an objective function that measures to what extent a task has been accomplished. And by optimization, you can search for an action that minimizes this objective according to the prediction of your world model. You can also put guardrail objectives that are constraints that the system has to satisfy in its optimization search that guarantee that whatever actions are being taken, the nobody will get hurt or the outcome will be good. Okay? So, a system built around this blueprint is intrinsically safe because it cannot do anything but satisfy your guardrails according to its world model. Now, of course, the guardrails might be a inaccurate, the model can be inaccurate. So, I’m not saying those models are necessarily perfect, but at least by construction, they will not knowingly produce actions that will you know, produce a dangerous results, let’s say. Now, if you have a world model, the the classical way to use them in optimal control, a very old engineering discipline, is that you can run your world model multiple times for, you know, multiple actions, a sequence of actions, and then you can plan that sequence of actions to arrive at a particular outcome. This is called MPC, model predictive control. Very classical. Except classically, the model is written by hand. You you have you want to control an airplane, you can write down the equations of the airplane, the dynamics of the airplane, and that’s your world model. Okay? Um you know the state of the of the airplane, you know the equation, you know the control variables, and so you can just do that. In fact, NASA has been using methods of this type to plan the trajectory of of rockets since they’ve had their hands on electronic computers, which is going back to the ’60s. Um Now, really, the way humans and animals perform planning is hierarchical. If I want to plan my return trip from here to New York or in this example, plan a trip from my office in at NYU to Paris, I cannot possibly plan my entire trip at the lowest level of actions, which would be muscle control, every millisecond. I can’t plan my entire trip in terms of millisecond by millisecond muscle control. Not just because it’s too long and complicated, but also because I don’t have the information. I cannot plan the the the whole sequence of actions in advance. I don’t know what the conditions are going to be when I get down on the street and I have to hail a taxi. How long am I going to wait? So, all planning is hierarchical. We imagine high-level actions like to go to Paris, I know I have to get to the airport and catch a plane. Now, I have a sub-goal, which is going to the airport. How do I go to the airport? I’m in New York, so I can take a taxi or take the subway. Let’s say I decide to take a taxi. How do I take a taxi? I need to go down on the street and hail a taxi because I’m in New York. How do I go down on the street? Well, I need to go to the elevator, push the button, and walk out the building. How do I go to the elevator? I need to stand up from my chair, pick up my bag, open my door, close my door, walk to the elevator, avoid the obstacles, push the button. At some point down in the hierarchy, I’m going to have all the information I need to just act because I have the information. Um so, I can plan low-level actions, but all of the planning we do is hierarchical. Now, here is a big secret in AI. Nobody knows how to do hierarchical planning. This is an idea. It’s not been tested. We’re working on it. But you should also work on it if you are interested in this question. Okay, so how are we going to train those world models? Okay? And the natural idea here is everybody is talking about generative models. We should train a generative architecture. All right? Let’s say we want to train a system to predict what’s going to happen in a video. Like, you know, LLMs are trained to predict what’s going to happen in the text. You show it a piece of text and you ask it to predict the next words in the text, right? Um and there are forms of this where you take a text and you remove some of the words and you train the system to predict the words that are missing. But an LLM really just predicts future words. Um why not do the same with video? Okay, take a video, show a system the first half of the video, train it to predict the second half of the video. I tried to do this for 15 years. The first 10 years of those 15 were basically a complete failure. Um and the reason was I was trying to predict pixels. Now, as it turns out, when you’re trying to predict what’s happening in a video, you can as it turns out, if you train the system to predict the next word in the text, you can do a decent job. You can never predict exactly which word will follow a sequence of words, but you can produce a probability distribution of all possible words in a dictionary. There’s only a finite number. So, it’s just a big list of numbers between zero and one that sum to one. We know how to do that. But if you train the system to predict what’s going to happen in the video, there’s an infinite number of plausible futures. And most of the content, there most of the information in the video is completely unpredictable. So, if I take a video of this room, I point the camera to the left here, I turn the camera slowly, and then I stop, and I ask the system, predict the rest of the video. It’s going to predict that the camera probably is going to keep panning. There’s absolutely no way it can predict what all of you look like. No way. I mean, it’s impossible, right? At a pixel level. It can’t predict the texture of the of the ground or the or the wood panels. It can’t predict there is you know, some sort of fireplace in the back. Um like Okay, most of the information in the world is not predictable. But it’s very often partially predictable. There is some abstract representation a formulation of the state of the of the world that is predictable. So, I can predict that, you know, it looks like a pretty full room, and so most of the seats are going to be occupied. I can’t predict what you look like, but, you know, system might be able to predict there is one person in almost every seat. And it’s probably going to predict at some point there is a wall, right? Even if you can’t predict the details. And we do this all the time. We we we predict things that are essentially not predictable, but like if if I take this object, I put it on my hand, and then you I’m going to lift my finger. You can tell what’s going to happen. Okay? If I repeat the experiment, it might fall in a different direction. So, you cannot predict exactly at the pixel level what’s going to happen. But in an abstract level, you can tell that this object is not stable and going to fall in one of the two possible directions. Um So, what does that mean? This means the idea that you’re going to train the system to predict what what happens in a video at the pixel level is simply going to work cuz the system is going to spend all of its resources trying to predict things it simply cannot predict. In fact, we did that experiment. We had a bunch of people going back 10 years where we tried to train a big neural net, I mean, big for the time, um to predict the, you know, a few frames in the future of a video and we get those blurry predictions because the system predicts the average of all the plausible futures. Um some of my colleagues at at FAIR also um did this with, you know, simulated environments. Um It’s complete failure. You know, eventually we figured out a way of uh making things a little better using latent variables um to take care of the the part of the video that is not predictable, but really it was kind of a failure. Um except for very simple videos. So, about 5-6 years ago, uh we came up with an another idea and that was derived from empirical results with attempting to use self-supervised learning for computer vision systems to understand images. What we discovered was that the systems the system architectures that worked the best were not generative architectures, but were what we now call joint embedding architectures. Okay, so you see the generative architecture on the on the left. Okay? You You show it an X and you ask it to predict a Y, potentially conditioned on an action or something you know is happening. Um So, X could be the beginning of a video, Y could be the continuation, or X could be or Y could be a an image, and X could be a corrupted version of that image or a different view of the same scene. Okay? And you’re supposed to predict Y from X. And that doesn’t work very well. Uh a bunch of teams at at FAIR and other places attempted to train things like variational autoencoders or masked autoencoders to learn representations of images. And the results were okay. It was super expensive computationally, but it didn’t work very well. What actually works is joint embedding. So, with joint embedding, you take both X and Y and you run them both through encoders. And you make predictions, but you make predictions in representation space. So, the encoder applied to X produces a representation of X, SX, which does not contain all the information about X. And the encoder for Y, which may be identical to the encoder for X, produces a representation of Y, SY, which may not contain all the information about Y. And then you train a predictor to predict the representation of Y from the representation of X. This works really well. There’s a lot of self-supervised learning procedures for image and video that actually use this type of architecture and they work really well. Um So, there’s a big issue though with this type of uh of uh of architecture and the big issue is that the system on the right, the joint embedding architecture, can choose to ignore X and Y altogether and minimize the prediction error by just producing constant SX and SY. So, you have to find a way to prevent the system from collapsing. And this is the difficulty of training joint embedding architectures, particularly JEPA. So, what is a JEPA? It’s a joint embedding predictive architecture. It’s a a system of the type I just mentioned. You have variable Y you want to predict, variable X that you observe. You run them both through encoders, which may or may not be identical. Um and you predict the representation of Y from the representation of X. You train the system by minimizing the prediction error, but you have to also ensure the system doesn’t collapse and one good way to do this is to attempt to maximize the amount of information coming out of the encoders. Uh That sounds great. Except we don’t know how to measure information. We certainly don’t know how to measure a lower bound on information content. We have all kinds of ways to measure approximate information and they are all upper bounds. When you want to maximize something, you need a lower bound so you can push up on it and the actual thing will go up. But we only have upper bounds. Okay? So, what do we do? We come up with the good upper bounds and then we cross our fingers. I’m sorry to say that’s what we have. Um and there’s very deep reasons why we can only have upper bounds to information content. We’re never going to find lower bounds, okay? Uh the reason is to measure information, you have to make some hypotheses about the type of dependencies that you allow your variables to have between them. And what that means is that you’re going to ignore certain types of complex dependencies, which means that the measurement of information you’re going to get is going to be overestimated compared to the actual amount of information, at least if you believe you have objective measures of information, which actually I don’t believe. So, it’s even more worse than that. Um but there are methods to do this and uh there is one that um uh called SigReg um which I’ll say a few words about. And if you want to know more about SigReg, so it’s a basically a kind of measure of information that’s differentiable and you can use to maximize the information content coming out of a neural net or an encoder or any learning machine. And this was, at least in part, invented at Brown. And certainly implemented at Brown by Randall Balestriero together with me. Where are you, Randall? Right here. Okay, if you have any question about this, talk to him. Okay? He’s a faculty new faculty in the CS department. Um And uh And so, I proposed this JEPA architecture uh several years ago, maybe 5 years ago, roughly. Uh since then there’s been something like 1,300 papers talking about JEPA on Google Scholar. So, this is not just, you know, Jan’s sort of obscure technique now anymore. Um Okay, so this idea that we need a system to learn abstract representations of observations to be able to make prediction is fundamental to intelligence and certainly fundamental to science. We do this all the time in science. In principle, I could explain everything that is taking place in this room here in terms of, let’s say, quantum field theory. Any theoretical physicist here? Okay, one. Uh two. Uh I mean, physicists will will tell everyone here, you know, all of you feel that just applied physics to some to some level, right? Um cuz I can reduce everything to physics. But in fact, we don’t because it would be completely impractical to simulate the our brain processes uh in sufficient details to predict our, you know, our reactions to this talk uh by doing quantum field simulation. Um So, we invent abstractions, particles, atoms, molecules, in the living world, proteins, organelles, cells, organisms, individuals, societies, ecosystems. Uh and every level of description in this hierarchy allows us to make longer term prediction than the level below and ignores a lot of details about the level below. Uh that’s the essence of being able to understand the world. Uh there’s a wonderful quote from Albert Einstein uh which is the most incomprehensible thing about the world is that the world is comprehensible. And it could be because there are all those abstractions that we can derive that allow us to predict or explain certain behaviors of the world that would otherwise would be way too complicated for us to understand. So, the the basic idea of understanding the world and modeling it and be able to make prediction is finding good good representations, abstract representations that ignore the details that we cannot predict. This is exactly what JEPA does. Um And of course, that’s applicable to uh you know, AI for science and things like this. So, world models should not be world simulators. They’re approximations. They should not be digital twins. It’s a very popular phrase at the moment. Uh they certainly shouldn’t be generative models because you don’t want them to be producing every detail of what you observe. They should not be video generation systems. A lot of people currently use the the the phrase world model to designate video generation system. That’s not what I’m talking about. Uh what they should be is action conditioned predictors in abstract representation space, preferably differentiable. Okay, so let’s say I have a jet engine. Jet engine has typically a thousand sensors in it. Is there a way I can predict the state of the jet engine from the previous state and uh intervention I’m making on it? Uh let’s say it’s a chemical plant or a power plant. Let’s say it’s a patient. Okay? I have a diabetes patient. Is there a way I can treat the patient with a course of treatment that will, you know, bring the patient to a good state? With you know, low blood sugar or whatever. Um So, the point is not to predict every detail, but to predict enough that is predictable. Ignore the details you cannot predict. Same for a robot. We’re not going to have useful robots until we have systems that basically can understand the physical world and currently we don’t. You see robots doing kung fu. It’s super simple because the only thing that involves is the dynamics of the robot itself and that we can write equations for. But the interaction of the robot with an object for manipulating it, we don’t know how to do that. Um this idea of world model is really old. It goes back to the 1950s or 60s. 50s in the USSR, 60s in the US. Or in the West. Um the idea that you can use a world model to predict what’s going to happen, and then you can plan a sequence of actions. So that as I was saying, it’s used by It’s been used by NASA since the 1960s. Um I wrote a couple vision papers about where I see AI research will go over the next few years. Uh the one at the top is a long paper, relatively easy to read. Um uh that I wrote in 2022. Uh and there is sort of various versions of this talk, somewhat some slightly more technical than than the present one. Um yeah, let me go back to this if you want to take a picture. [laughter] Um Okay. Um I’m going to use the last few minutes to uh talk about this Sigreg idea, and uh a model that Randall called Logepa. I have no responsibility in the name. Uh it’s kind of a running joke, because my boss at Bell Labs, Larry Jackel, whom you know, uh called convolutional nets back when I was at Bell Labs in late 80s, he called it Lonet. Okay, and the name stuck. And now I have to carry this low everywhere. Um so the idea um that uh Randall and I, mostly Randall, worked on is the idea that if you want to maximize the amount of information uh coming out of the encoder, you get you pass a bunch of samples through it, and you’re going to get a bunch of points coming out of it. And if you insist that those points form a isotropic Gaussian distribution, the variables are going to be independent of each other. Okay? Because a joint Gaussian which is isotropic, the the variables, individual variables, are independent. Right? So if you want to maximize information, you have to make the variable independent of each other. Um and making the distribution isotropic Gaussian is a good way to do this. Uh and it turns out you can do this differentially. You can figure out you can project the joint distribution over multiple axes in this potentially high-dimensional space. And when you what you get is a bunch of scalar uh distribution, one-dimensional distributions, and you can differentially compute the distance between the Gaussian distribution for each of those projections and the empirical distribution. And you can move the points around so that they collectively become jointly Gaussian. Uh and it works really beautifully, and there’s various mathematical tricks to do this. Uh and uh there’s a visualization that Randall generated that that show uh um you know, various methods um you know, of turning So AbsPully is the one he used at the right to kind of turn a bunch of of original points and move them around so they become sort of joint joint Gaussian. Uh and more recently, um together with Contantine Ledoux, who’s a post-doc at NYU, uh Lucas Maystre who’s at Mila, and Damien Seyer, um uh Randall and I kind of worked on this thing called Low World Model, okay, running joke. Uh which, you know, we can train a world model to um uh which is action conditioned, and use it for for planning, for sort of various tasks. Uh you know, simple robotics task and everything. This needs to be scaled up, but we think we have kind of a good handle. I’m not going to bore you with details because I’m running out of time. There are alternative approaches to this which have so far been scaled up more than this Sigreg Logepa method. Uh one in particular called I-Jeppa or V-Jeppa or Dino V3, which is done by some of our colleagues at at Meta. Which has lots of applications, probably the best generic uh image feature extraction system that in existence. Uh but um uh let me skip ahead a little bit. If my computer agrees to. Um and and and talk about V-Jeppa. So this is a video prediction system which is not trying to predict in time, but you you take a sequence of frames, you suppress some information about the sequence by blocking some parts of it. Uh not just the future, but like an entire chunk. And then you train one of the Jeppa architectures to predict the representation of the full video from the representation of the partially masked video. Uh with, you know, various tricks. And so it’s not using Sigreg or any kind of uh regularization of this type, it’s using what’s called a distillation method, which I’m not going to go into the details of. Um and, you know, it’s not entirely satisfactory in terms of the property of these algorithms, but it it scales up and it works really well. And what we have at the end, training the system on a relatively small amount of data, well, small amount is like 100 years of video. Uh which basically is a day of YouTube uploads. Um which is way more than any LLM has been trained on, but that can be trained on like 2,000 GPUs in a few days. So it’s not um it’s not a gigantic LLM. Those systems actually have learned some level of common sense and need to do the physics. You show them a video, maybe a synthetic video, where something impossible occurs, like someone throws a ball and the ball disappears. Or it turns into a cube. Or it stops in the air. Uh the internal prediction error shoots through the roof when those events occur, because the system knows this just cannot possibly happen. Um so that’s really interesting. It’s first time I I see this kind of uh behavior, and I’m sorry the video somehow is not available because the video is on the web and I’m not connected. But anyway, uh let me conclude. So I’m recommending I’m making a few recommendations here, which make me extremely popular among some of my colleagues in machine learning, particularly in Silicon Valley. Okay? Abandon generative models in favor of those Jeppa. Abandon probabilistic modeling. So I didn’t have time to talk about this thing called energy-based model, which is basically the way to understand uh the the sort of theoretical paradigm with which to understand all of this. Um abandon contrastive methods in favor of regularized method. I didn’t talk about this, but those information maximization methods are basically uh a type of training for those energy-based models, which are called regularized. And uh and and I I put the last one just for Michael. Abandon reinforcement learning. I mean, really what I mean is not abandon it, just minimize its use because it’s so damn efficient. But you have to use it, you know, you have a choice, right? Okay, right. Okay. Did you Did you just try to schmidty verb me? No. [laughter] No, you never. Um no, it’s just that, you know, RL is inevitable in the sense that there are situations where the only way to improve your performance is to try something and see if it works, and then try something else, and if it works better, adopt that. Uh and, you know, RL is basically a you know, sophisticated version of of that. Uh but uh but the bulk of the learning, you know, takes place when you learn to represent the world appropriately and make predictions. And then learning a task on top of this should be a very fast process, very efficient, that requires a very small number of interactions, which is why what allows us to learn to drive in a few hours of practice. So don’t work on LLMs if you are interested in human-level AI. Uh and in the long run, the the plan for the new company I I formed is to become basically the main provider of intelligent systems. It’s very modest. Okay? Not ambitious at all. And applications of what we’re developing really abound in all major sectors of the economy, uh in sort of academic fields, in physical science and biomedical sciences, understanding complex systems at the phenomenological level is really what we need for that. Uh and then in industry or economy, uh health care, manufacturing, automotive, aerospace, defense, transportation, logistics, pharmaceutical, etc., etc. So we’re developing AI systems for the real world as opposed to just manipulating language. Uh and we’re hoping that um eventually we will build completely intelligent systems based on those hierarchical Jeppas. And uh solve a lot of interesting problems. Thank you very much. [applause] Yann, thank you. That was fascinating, and as Michael predicted, you don’t pull punches. That’s terrific, very candid. So I can’t help but noting as a chemical engineer, do we have other chemical engineers in the audience by any chance? MPC is model predictive control is the backbone of a senior course that’s taught to all chemical engineers. And you may know that MPC came not out of the academy, but the private sector. So the model predictive control algorithm was born at a refinery in Texas in the 60s. And it’s it’s a wonderful example of algorithms and technology emerging from the private sector first, and then the academics do the theorems and the proofs and show stability and robustness and all those kind of details. Can’t help but feel like we’re in a moment like that again where the private sector, maybe in this case, is really leading the way. Help me understand where you see that balance between what private sector brings versus the academy. So, if I count like how much time I was spent I spent in academia and industry during my career, it’s about half and half. Uh the first 12 years, 12 13 years, were in industry at Bell Labs, AT&T Labs, and NEC Labs for 18 months. Then in 2003 I became a professor at NYU full-time for about 10 years and then joined Facebook at the time and was basically half and half between Facebook and NYU for a dozen years. And still now I’m I’m still professor at NYU and I’m also executive chairman of that new company. I don’t have any management role because I’m terrible at that. Not an operational person. Um so, it’s more like, you know, scientific leadership and strategy, but um uh but I like that interaction between the two because the two worlds are complementary. Uh they’re complementary because the motivation for the people in those two environments are different. Right? In academia you want intellectual impact. In some parts of industry where you have ambitious research lab like Bell Labs used to be or fair used to be until recently, you also have this ambition of having intellectual impact not just internal to the company you work for but also external. Um and uh in industry there’s a different set of motivation. So, it’s okay when you’re in academia, for example, to work on a technique that you know is a dead end because it might give you a paper at the next conference or in a journal. And you know you’re not going to pursue this, but you’re going to get that paper out because it’s it’s useful for other people to know, you know, that you can go that direction and maybe some other person will have the idea of pushing it forward even if you don’t. Uh in industry you don’t do this. If it doesn’t move you forward to your in your kind of main path, which is longer term, you just don’t work on it because it’s a waste of time. Okay, so this is the exploratory uh mode of research which is very much academic. Uh and then there is the the thing where you kind of try to make progress on sort of a a long-term long-term goal. Uh in the early in my career when I uh worked on, you know, convolutional nets and we applied it to character recognition, right? To handwriting recognition. I had no interest in character recognition. This is not what I was interested in. I was interested in building intelligent machines that had to be done with machine learning because I didn’t think I was smart enough to just conceive an intelligent machine, right? It had it had to build itself. [laughter] Uh and then, you know, what what better uh domain than perception? If you can do perception, you can do a lot of things. And uh character recognition was pretty much the only task, perceptual task, for which you could have data at the time. There was no USB camera or anything. Um And so, that was the logical thing to do and it turned out to be practically useful even though that was not in the plan initially. There was really a very ambitious long-term plan. So, pick a long-term very ambitious plan and work your way back and you’re going to make progress. Useful things are going to come out of that. Terrific advice. So, staying on the academic front for a moment, you know, one of the challenges you must see this at NYU. In fact, given your vantage, you must see this globally, is how we think about education in this moment. Because there is the questions of how will these tools be perfuse in teaching and pedagogy in the curriculum, but I would argue the more compelling question we should be thinking about is how we train the students who are going out and will enter a world in which there are expectations about competency and skills. What’s your vantage given your kind of 50/50, you know, dual-hatted perspective here on how we should think about that? Well, I’m merely a professor, so I don’t know anything about education, but um but what I whenever I mean I get I get this question pretty often either from students or sometimes from their parents. Uh you know, why should I study like AI systems? They’re going to be doing everything for us, right? Maybe I don’t need to go to college. That’s not true. Uh in fact, the trend that we’re seeing at the moment is very difficult to identify this particularly in the US, but the trend that we’ve been uh seeing over the last decade is that there is more demand for more advanced degrees. Uh certainly that’s true in computer science. There’s more demand in industry for people with PhDs. In AI in particular, but not just. Um than there was just 10 15 years ago. Uh it’s it’s true in the US. It’s much more striking in other countries that didn’t have this tradition. There a lot of countries in the world where doing a PhD was seen as a complete waste of time uh unless your purpose was to become a professor. Uh and this is still true in a lot of domains, but in in STEM, uh there’s a lot of demand from industry for people who who have PhDs. Because why is that? It’s because more and more the the health of the economy uh and the growth of the economy relies on technological innovation and technological innovation relies on scientific breakthroughs. And those come about by people who do research. So, research is becoming more and more crucial to economic expansion. What has been one of the main factors of the economic success of the US since World War II is a very very strong research ecosystem, you know, with universities, public research. My colleague Matt can tell you more about this because he was at NSF until recently, which I’m not happy to say the US is currently completely destroying, burning to the ground, which is insane. Um So, so there’s more demand for more advanced uh uh degrees. And regardless of whether AI is going to help us or not, what is that going to change to the way we we work? Uh all of us instead of doing a lot of low-level tasks that AI can do, we’re going to be their boss, right? We’re going to tell them what to do. It’s like, you know, all of us even at a junior level will be the manager of a team of intelligent agents kind of doing work for us. We’re just going to be managers essentially. But a good manager knows the low-level technology. So, that’s the first thing. Second thing is so, study more not less. Uh the second thing is um what should you study? Now, technology technological progress is accelerating. Um it’s not a new phenomenon. It’s accelerating, but now it’s accelerating in a sense that the the the the time constant are very short. So, you’re almost certainly going to going to have to change job during your career. Because a new technology is going to come in that is going to make your old old job uh completely different and obsolete. You’re going to have to learn something new. So, what you should do in college and school and grad school is learn to learn. Or perhaps invent new things yourself, uh which is what I was saying with the previous point. Uh so, learn to learn. How does that translate um into what you need to study? Study things that have a long shelf life and that are very fundamental. Uh the joke I usually say, which is not a joke despite the fact we’re April 1st, uh if you had the choice between, let’s say you’re studying computer science, and there’s a choice between taking a course on uh mobile app programming and quantum mechanics, take quantum mechanics. You might think it’s completely useless to computer scientists, but it’s not true. It turns out like all the all the mathematics that is used for Bayesian inference is the same as statistical physics. Uh so, if you want to be able to learn quickly new things, learn the basics, learn fundamental things. Great advice. Terrific. So, maybe delving in a little bit around some of the algorithms you talked about. Again, I’m reflecting back to things like model predictive control and even the applications you spoke of, aircraft engines, the medical problem. These are systems that we do have some rudimentary knowledge at least at the appropriate scale, as you correctly noted. So, we can make sort of first principles or as a lot of folks refer to physics-based. So, as you and I talked about earlier, there’s this sort of PINNs, physics-informed neural nets, and we people like George Karniadakis in our applied math division here who pioneered work in that space. Tell me a little bit about how you see intersections between how you bring causality and constraints versus how a PINN approach might do that. So, causality Okay, the bad news is that in physics causality is not a well-established concept because all microphysics is time reversible, so causality does not exist in the first place, but uh at least until you start talking about thermodynamics uh or collective phenomena that seem to have a time direction. Um The I think an important point is there are a few a small number of systems for which you can write down a you know, a small number of equations that are more or less solvable that will allow you to predict the behavior, right? Of the system. So, you want to predict the behavior of a small molecule interacting with another molecule, you do a simulation with what’s called density functional theory. Basically models the density of electrons, you know, around a molecule and figures out like how they stick to each other. It’s very expensive. You can only do it with small molecules. Unless you have gigantic clusters of GPUs. Um you can’t do it to predict, for example, the confirmation of a protein. It’s It’s just too complicated. Right? So, for a protein, you need to have a specific model that has been trained from data to produce the right result. And of course, those models incorporate some very deep knowledge about the the nature of the problem. Like, you know, AlphaFold and ESMFold uh have, you know, some some pretty clear way of representing the confirmations of the of the protein to be able to to make the the prediction. So, there’s a lot of knowledge that uh that is in there, but it’s a very sort of high-level conceptual knowledge. It’s not uh it’s a very very detailed to the point that you can reduce to a small number of equations. You have to train the model to make the prediction. Uh most complex uh phenomena uh in in the that we observe are of that nature where you cannot go from the microscopic to the mesoscopic because the underlying process is also complicated. For example, you you you know, you you might be able to write down uh a small number of equations that uh govern the behavior of a neuron in the brain, but that doesn’t mean you understand the brain at the system level. Uh certainly not the learning mechanisms. So, So, this idea that, you know, you you you might be able with AI systems to learn a phenomenological model of a complex system that you cannot reduce to a small number of equations in the traditional uh reductionist approach. Uh I think it’s very powerful. Um I’m sure there’s a bunch of material physicists here. Uh there’s something called the magic angle. You take a monolayer uh sheet of uh graphene, okay? Single monolayer uh bunch of carbon atoms around hexagonal structure. You put another sheet of the same stuff on top of it. You twist it by 1 degree, 1.1 degree, and it becomes a superconductor. Why? We don’t have like, you know, reductionist explanation for this. There are theories. Uh there’s a lot of people interested in, you know, develop- being phenomenological model of uh uh material properties that may help us discover, you know, new types of batteries, new ways to uh there’s some colleagues at at Meta that have this really cool project uh where they they they’re doing uh so, the DFT simulation and this kind of stuff using the supercomputers that uh Meta has access to uh to generate training data for a system to predict what is the interaction between a molecule of water with a particular material, whatever it is. Uh with the idea that if you might use this model to design a particular material that would facilitate the separation of hydrogen from oxygen in water, you might have solved the uh energy storage problem. The reason why we can’t cover a small desert with uh solar panels to produce uh electricity is that we need to store the energy somewhere. And separating hydrogen from uh oxygen in water is a very good way to do that if you can make it efficient. You can only make it efficient if you have catalysis. You’re a chemist. I’m not. You know, catalysts that can do this and basically AI can help us perhaps discover some some new ones. That’s a really interesting problem. Another problem they’re working on is you want to make displays or smart glasses so you can display information. You need optical materials with a very high refractive index. Um there’s one that we know, silicon carbide. It’s too expensive to to manufacture. So, what if we can discover a new one that has a high refractive index? They actually kind of made a few proposals for that. Yeah, I mean, I think you’re raising really good points that the complexity, as you’ve invoked a couple times, sort of dictates how sophisticated the predictive model needs to be. Simple problems may not need more than a mixing with the temperature of this room, controlling the temperature of this room. I don’t need a quantum model to predict for hours ahead, right? Simple thermodynamics and mixing will do that. So, it’s got to match the complexity and the the objective you’re trying to achieve. So, I have one more question for Yann. So, as a warning to the audience, if you’ve got questions, get ready to ask them. We’ll have runners with mics. But, um Yann, one of the observations about the moment we’re in right now with AI is it feels very disruptive, kind of in our face. Uh people are being forced to confront it in a lot of ways. And I wonder if you could comment on if AMI is successful, if world models are implemented, uh time scale of, I don’t know, three to five years, it will this ever truly be sort of an invisible technology that’s not, you know, sort of in our face and is seamlessly integrated into human tasks, or is it always going to have a an interface that feels disruptive? So, the kind of stuff we’re working on at first are going to be be, you know, behind the curtain. They’re going to be applied to industrial processes. Uh you know, situations in industry where you have tons and and tons of sensor measurements, and you want to optimally control a process, whether it’s manufacturing, a chemical process, you know, a refinery, a power plant, or maybe some production of a drug of some kind. Uh so so, that’s going to be, you know, business-to-business, B2B, not not sort of in the face of consumers. Uh eventually, I think all intelligent systems that users will want to use, particularly uh wearable assistants, uh are going to be of the type that I’m I’m describing. And can use by everyone. So, there is a future in which all of us will be sort of wearing devices like smart glasses and things like that. Uh uh which will have AI assistants that we can talk to in various ways, either by speech or through uh what’s called EMG interfaces, electromyogram interface. So, basically a bracelet that you wear that allows you to point and click and even type or handwrite. Uh and with a display. And we’re going to have those AI assistants with us at all time. And we’re going to be like, you know, a captain of industry or a politician walking around with a staff of helpers. Except those helpers might eventually be smarter than us. I mean, it’s already the case, right? It’s clear that, you know, the staff of politicians are way smarter than they are, right? Uh that is, many politicians. If, right. In many countries, yeah. All right. So, we’re going to open it up to uh questions from the audience. Again, if you raise your hand, we’ll we’ll bring a mic to you. If the runners could Yeah, sure. In the back there. Please introduce yourself and and who you are here at Brown or in the community. Hello. Cool. Hi. Um my name is Sergio. I’m a PhD student here at Brown University. Um I’m I’m very interested, very bullish on world models for robotic systems, and that’s kind of what my research is all about. Um so, I’m just going to read off of uh the AMI website, which states, “We can build the future of AI together with industry partners, product developers, and with the global academic research community via open publications and open source.” So, given how industry research institutions in the past have made the same promise to work with the global academic research community only to turn for-profit and become closed source once they believe they can monetize their models, you know, can we as a community actually believe that AMI Labs will be any different? And is there any way that you can ever ensure that? So, it’s called It’s pronounced AMI. AMI. AMI, sorry. Uh it’s a French company. I mean, it’s a global company, but the headquarters are in France, and so we pronounce it the French way. Uh but uh no, it’s a very good question. And uh I think the the option of being closed or open is kind of by model. You can’t look do it You can’t do it halfway. You You have to be very open or completely closed, right? You have to be, you know, OpenAI went from one to the other. Google has always been somewhere in the middle. Uh FAIR, the research part of uh you know, fundamental research part of of uh AI research at at Meta, uh is slowly going from essentially fully open to not so open. Uh And certainly Anthropic has always been completely opaque, right? So, what we plan to do with AMI Labs is uh I’m a firm believer in the idea that good ideas can come from anywhere, and they come from the interactions between people working on different assumptions, with different motivations, in different environments. And what we’ve seen at FAIR, at least until recently, is that the the uh the best way to do exploratory research, including in a university environment, is to host resident PhD students and collaborate with universities. Um because, you know, I was telling you that there are exploratory projects that may or may not have, you know, something to follow. Uh and there is sort of the the main line, the mainstream of where you want to make progress. Permanent staff in an industry research lab tend to work on this mainstream. If you want to do exploratory, it’s very hard to do this with small permanent staff because that’s taking a risk for them. And so you do this with postdocs and PhD students. And so that makes you in a completely different mode. Now, we have empirical data on this. In 2015, I created the Parisian branch of FAIR, FAIR Paris, and shortly thereafter Google Google DeepMind opened a lab in Paris as well. They didn’t have a lab in Paris. They have a Google lab in Paris, but not DeepMind. Uh Google uh uh FAIR Paris has been an enormous success for internal success for the company. A lot of things have been produced at FAIR Paris that have a huge impact on the company. Uh one of them is Llama. Llama 1, the first version of of Llama, the open-source LLM, was produced by a small team of 12 people in Paris. Okay? It’s not American technology. It’s actually French technology. Um paid by American dollars. Add. Uh And then it was picked up by the rest of the company and developed into Llama 2 and 3 and etc. Uh At any one time, until recently, FAIR Paris had 40 resident PhD students. The PhD in France is 3 years because you do a master’s before. Uh so, FAIR Paris graduates 12 PhD students per year in AI. That completely jump-started the Parisian ecosystem of AI. Two of the co-founders, two of the three co-founders of Mistral, uh which is the you know, big French AI company, uh did a PhD in residence at FAIR and worked on Llama 1, and then left and started Mistral. Uh And you know, I can I can give you like a dozen companies of this type where where which you know, were created or helped by people who did their PhD there. What happened was that also we were very integrated with the uh academic research community uh in France and and just profited from that because of the openness. You compare that with Google DeepMind in Paris, it was not nearly as much of a success because DeepMind is much more secretive. They don’t like to have students who kind of you know, see what they’re doing internally and kind of take that uh everywhere else afterwards. And so they could not have because of IP issues, they could not have resident PhD students. As a consequence, their impact on the ecosystem is much much much smaller. So, you know, you want to be in that mode where you progress quickly because you collaborate. And and that’s the mode we’re going to be at least at the fundamental research level. Now, of course they’re going to be part of the company of AM Labs that are going to be more about, you know, technology transfer and product development and that doesn’t is not helpful if it’s open. Great. Thank you for the question. Maybe we can take one up front here, gentleman with the sunglasses on the side. Do we have a mic? Yeah. Hi. I’m Takis Dimitrakopoulos. I’m a senior from Athens, Greece studying applied mathematics and classics. Thank you very much for your time here today. I wanted to ask you the following. You argued in your presentation that a lot of the LLMs can’t achieve human-level intelligence because they lack, among others, persistent memory, um grounded experiences, the world models that AM is going to develop. And I wanted to ask that if an LLM eventually passes every benchmark we design, at what point does the distinction between actual cognition and true reasoning, sophisticated pattern matching um becomes philosophically meaningless? At what point does that just equate to moving the goalposts? Okay, so as I said, I think intelligence is not a collection of skills or an accumulation of knowledge, but an ability to solve new problems that you’ve never faced before. Uh let me take a concrete example. Uh there there is no way to design a set of questions or problems uh that the next generation of LLM will not be able to solve. And it’s very simple. It’s like as soon as you come up with a set of questions that LLMs cannot solve, you just integrate them in the training set and then the next generation is going to install those questions. I’ve I’ve been at the receiving end of this multiple times when people ask me like, you know, is there any question that an LLM cannot answer? I said, yeah, like we have an intuitive understanding that if you put an object on the table and we push the table, the object will move with the table, right? That’s very simple physical intuition. So, I said that during a Lex Fridman interview, um and you know, 6 months later someone like asked the question to, you know, GPTX, and of course GPT, whatever it was, 5, answered the question uh correctly. And so a lot of people on Twitter, you know, haters on Twitter said, oh, Yann LeCun is out of it, like he’s totally stupid. You know, he claims all these things that are false, blah blah blah. Uh but it’s very simple, like, you know, the the the GPT version in in question was trained with that question. That question was publicly available. Uh OpenAI has dozens of people actually kind of listing all of those questions that might be asked. And you know, because I asked that question to Lex Fridman, you had immediately hundreds of people trying that on ChatGPT. And so now now OpenAI has the data of people asking this question and of course they’re going to train it to actually answer that question correctly. You know, there’s this example recently that someone you know, there’s someone that comes up with those those questions that LLMs can’t answer properly. He said like, I need to uh wash my car and the car wash is 100 yards from my place, should I walk? And you know, they all say, yes, you should walk because you know, there’s no point taking your car and it gives you a whole a long list of reasons why you should not take your car for 100 yards. Uh of course those systems don’t have any sort of you know, physical intuition, right? Which is why they answer it this way. You can be sure that the next generation of LLMs is going to answer this correctly because you know, that is now part of the training set. Uh so, there’re not going to be any finite test that is going to determine whether a system is intelligent or not because it’s just information retrieval. It’s the big question is, here is a new problem you’ve never seen, can you solve that? Thank you. Thank you for the question. Yeah. We’ll take one right here, the gentleman in the tie. Hey. Uh my name is Dominic. I’m a Brown alum. And one thing I was wondering, uh so do you believe there’s any place for large language models in amplifying the abilities of world models similar to how the different specialized regions of the human brain cumulatively amplify human cognitive performance? Okay, we do have an LLM in your brain. Uh we have a a small part of our brain right here behind the ear called the it’s about this big, it’s called the Wernicke’s area. That’s what allows you to turn language into thoughts. And then there is another one right in front of it called the Broca’s area, also about this big, uh that allows you to turn thoughts into speech. You can’t speak if you have if you don’t if you have damaged Broca’s area. Uh those things popped up in evolution maybe a million years ago, maybe a bit more. They cannot be that complicated because they’re so recent and so small. You combine that with your hippocampus, which is where you store a lot of uh declarative knowledge, and that’s your LLM. Okay? But your intelligence is this. Okay, we can we have a good handle on perception. That’s the back of the brain. Like the front of the brain, that’s your world model. That that’s what you reason with. Uh and that’s, you know, what we attempt to uh perhaps uh emulate with uh AM Labs and world models, uh but LLMs just don’t have that. Thank you. All right, do we have one further back? Let’s take advantage of the back of the room. Gentleman in the black jacket there with the glasses, yeah. Um Thanks for the talk, Yang. Um I’m TJ. I’m a PhD candidate from Biostatistics and I um I think I have a more concrete question about the world model itself. Like, how you define the state in the model? Like, how far from the state is that latent from what we observe? And could it be achieved by like I have a fancy decoder design or more dedicated loss function design? Like, it’s not something beyond current um deep learning framework. Okay, so what is the state? That’s a That’s a good question. Uh the state is not an objective property of the phenomenon you’re observing. It’s an abstract representation that serves the purpose of making predictions. And the state that you will use to make predictions in 1 millisecond, with a 1 millisecond time frame, is not the same as the state you’re going to use to make predictions an hour from now. Uh and so what you need to what what the state what the the J bar or similar model need to represent by the state depends on the type of uh the the the horizon of the prediction that’s supposed to do. The longer term prediction you’re making, the fewer details you can keep in the state because there are things you just cannot predict long term if there are too many details, right? You cannot predict the individual trajectories of molecules in this room. Uh that’s just too much competition. Also, measuring this the initial state would be would be very hard, uh, impossible, in fact. Um, and and very quickly the simulation will diverge from from reality. Uh, but if your purpose is to, you know, predict the pressure in this room as a function of temperature, you can do that. PV equals nRT. You don’t need to simulate every individual molecule to be able to predict that. Or similarly, if you are an an aircraft designer, you want to design the airfoil of of the wing of an airplane, you can do computational fluid dynamics. You don’t simulate every molecule bumping into each other and onto the airplane. You model the state of the air in a little cube by velocity, pressure, and temperature, and then you solve Navier-Stokes equations. So, everything we do in prediction requires to find an abstract representation that eliminates all the stuff we cannot predict and allows us to make prediction at the time scale that we are interested in. Uh, and so, what Jeppa the way Jeppa is built is that it finds this trade-off between uh, maximizing the information content of the representation, but yet only preserving things that information you can predict. And then depending on the horizon at which you train the prediction, you will get different level of uh of details in the representation. Great. Uh, maybe on the row here, the lady with the glasses. Thank you. I’m Sonia. I study anthropology and computer science. So, an interesting mix when it comes to world models. I wanted to ask you how you approach building teams at Ami Labs and in general people who work on world models because even with LLMs, we talk about lack of diversity, right? And lack of um variables that go into building a good response. World models feels exponentially larger than that. So, when do you stop building your teams representation-wise, but also when do you stop building world models variable representation-wise? Thank you. So, the world model itself, uh, you know, the type of things I will capture depend on the data it’s trained on. And most of this data is, you know, physical measurements and stuff like that. Of course, you could also imagine training world models that try to, you know, predict uh uh human behavior. Uh, in fact, we we have a name for it. We call it mental world model. Like if you want the dialogue system to be to uh, to be useful, like let’s say uh a system that is supposed to, you know, help student learn particular material, you want that system to be able to have some idea of the mental state of the of the person he’s talking to and then have some way of predicting what is going to be um the mental state of that person once I tell the person a particular piece of information. Is that person going to be able to absorb that information and learn something from it or just completely uh ignore it? Uh, so that’s that’s called mental world model and there we you know, we start to get into like uh um you know, questions of, you know, who can receive what type of information, etc. What, you know, biases are, etc. Uh, I believe there was some other question about uh in or or implicit question at least in your in your question about the composition of the staff at Ami Labs. Yeah, uh uh What’s the proportion of women in your CS uh class? Say again? 40%. 40%? 40%? That’s great. That’s awesome. Uh The average national average is closer to 20%. Probably between 15 and 20. And it’s been the the case for quite a number of years. It went down from about 40% in the ’80s to between 15 and 20 uh nowadays. Uh, some universities like like Brown, CMU, and others uh you know, make a point of uh trying to kind of balance. Um There is no particular change of proportion at the graduate level. So, the proportion of uh women at in in PhDs is roughly the same as undergrad. So, there is no leak in the pipeline at that level. Um But there’s clearly a a you know, small number of uh women uh getting into into that pipeline nationally, even if it’s, you know, better at Brown. Uh, how do we fix that? Like why is it that over the last few decades we’ve uh reached uh parity in things like medicine, law, and various other domains and not STEM? I don’t have an answer. It’s a challenge. I I’m going to take the liberty, we’re just about out of time, to pose one last question, which kind of feeds in particular the background of a couple of our questioners here that might shed light on what we’re doing differently at Brown here. And you’ve heard these dual interest in humanities and CS, in social sciences and CS. And so, my question for you is um a little bird told me you have a particular passion in music. Uh, [snorts] tell us a little bit about that side of your own duality in marrying humanities and tech. Yeah, uh since uh since I was uh in uh middle school, as a matter of fact, and high school, I played in a Renaissance uh band. So, I played Renaissance and Baroque music, so voice, wind instruments. Um and uh also played uh Breton folk music. That’s pretty obscure. Um Proto folk? Breton. Okay, Brittany. You know about Brittany, right? So, Brittany is the western part of France. And the tradition there is is a it’s a Celtic population, the same origin as Ireland and and and Wales and uh places like that. Uh, traditionally in western Brittany uh spoke uh didn’t speak French, but spoke a Celtic language called Breton. Uh [snorts] uh similar to Welsh. And uh the musical tradition there is Celtic. Uh, so there is like a very distinct uh culture there and uh I used to play that music and it’s very similar, in fact, to uh Renaissance dances. Uh, the the the sort of uh uh not the dances that you would you dance in the courts, but the dances that, you know, everybody in the villages would uh would dance in Renaissance where, you know, the popular uh folk dances are inspired by that, basically uh direct descendants. So, so that there was kind of uh a connection between those two things and it taught us how to play Renaissance music at the right speed. The only the almost the only two times you can hear Renaissance music in the US is if you go to a Renaissance fair or at Christmas time. Mhm. They they play fake Renaissance music with modern instruments as if they were Christmas songs and they play them about three times slower than they should. Uh it drives me crazy. Okay. [laughter] But then uh after I came to the US, I became I was always a a fan of jazz and and various versions of improvised music and I always thought the intellectual process of improvisation as mysterious to me. And so, I decided that I should learn to do it, to figure out like what’s what’s the deal. So, I tried to, you know, uh play jazz. I’m a terrible performer. Um but um I got interested in this. Uh, but in fact, I would tell you something. I got interested into electronics and uh my my my undergrad is in EE. And I got interested in electronics and [clears throat] and computer science eventually uh through music. I wanted to produce music with computers. And also, I played with synthesizers. My my cousin, which was a little who was a little older than me, uh was an aspiring electronic musician in the ’70s. And I knew a bit of electronics when I was in high school, so I would hack his synthesizers so that he could produce new sounds. Uh, and I keep doing this. I have like a whole bunch of synthesizers in my home. I’m a terrible performer again and I don’t play keyboards. Uh, so because I don’t play keyboards, I built myself wind electronic wind instruments with which I can control my synthesizers. Again, that’s my hobby. Far out. Wow, what a Renaissance man. Please join me in thanking our wonderful Lamle speaker, Yann LeCun. [applause] Thank you. And in the spirit of his comments about the Celtic language, one of the very few words I know in Gaelic is slancha, which is uh Gaelic for cheers. So, please join us for a reception out in the lobby there. In fact, in Breton, it’s yec’hed mat. There we go. Okay, before you before you go away, I’m going to take a selfie, okay? So, don’t move. All right. Ready? Good luck to you. Come down here with Michael. All right. Great. Thanks again. [applause] Mhm.