2026 Lemley Lecture Featuring AI Pioneer Yann LeCun
ELI5/TLDR
Yann LeCun, one of the three people who basically invented modern AI, stood in a room at Brown and told everyone that the thing everyone is obsessed with — large language models — is a dead end on the road to real intelligence. His argument: a four-year-old takes in more raw information through their eyes in a year than every word ever written on the internet. Language is a thin skin on top of a thick physical world, and you can’t get to a domestic robot that does your laundry by reading more Reddit. What you need instead is a “world model” — a system that watches the world, learns how it behaves, and can imagine what will happen if it takes an action. He calls his version JEPA, and he just started a billion-dollar company to build it.
The Full Story
The opening provocation
LeCun starts by apologising for being, in his words, infamous for “bashing” parts of machine learning. He gestures at the room: AI systems can pass the bar exam, win international math olympiads, and write code. Fine. But where is the robot that cleans your house? Where is the seventeen-year-old robot that learns to drive in twenty hours of practice instead of needing fifteen years of Waymo engineering and a small nation’s worth of sensors? “We have systems that can manipulate language,” he says, “and they fool us into thinking they are smart because they manipulate language.” In the physical world, they are helpless.
The hundred-billion-dollar industry betting that bigger LLMs will eventually think like humans is, in his words, “complete BS.” He clarifies — the money isn’t wasted, LLMs are useful — but the trajectory doesn’t lead where the checks are written.
The data argument
Here is the cleanest part of his case. Imagine a modern large language model trained on essentially every word on the public internet. That is somewhere around ten to the fourteenth bytes of data — if you or I sat down and read it at a brisk pace for nine hours a day, it would take about 400,000 years to finish. A lot of text.
Now imagine a four-year-old. Two million optical nerve fibres running from each retina into the visual cortex, each carrying roughly one byte per second, for the sixteen thousand hours the child has been awake. Run the arithmetic: that’s also about ten to the fourteenth bytes. The same volume of data — flowing through the eyes of one small person, not the entire written output of human civilisation. And that four-year-old can reason about the physical world in ways no LLM can touch.
The conclusion writes itself: language is a low-bandwidth channel. If you only ever learn from text, you will never see what a baby sees in their first year of life. “We’re never going to get to human-level AI by just training on language. It’s just not happening.”
What we actually mean by “understanding the world”
Babies aren’t born knowing that an object hidden behind another object still exists. That concept — what psychologists call object permanence — shows up around two to three months old. Intuitive physics, the creeping sense that dropped things fall and momentum is conserved, takes about nine months. Show a six-month-old a toy car that floats off the edge of a table and they shrug. Show it to a ten-month-old and their eyes go wide — their mental model of the world has just been violated, and they notice.
That mental model is what LeCun means by a “world model”. It is not a catalogue of facts. It is a predictive engine: given the current state of the world and an action you might take, what happens next? Think of it like a flight simulator running inside your head. When you reach for a coffee cup, your brain is quietly computing where your hand needs to go, what the cup will feel like, what happens if your grip slips. You don’t consciously notice this. You just notice when the simulation gets it wrong — when the cup is heavier than expected and your arm jerks.
Why LLMs can’t just be scaled up
The way an LLM works is straightforward. You feed it a sequence of tokens (roughly, sub-words), and you train it to predict the next one. When people talk about LLMs “reasoning,” what they actually mean is that the model has been trained to produce lots and lots of intermediate tokens — showing its working, essentially — and that extra computation time approximates something like thought. LeCun is not impressed. “That’s not really what reasoning is,” he says. “Reasoning is a search.”
Imagine the travelling salesman problem — find the shortest route through a list of cities. A real reasoner doesn’t just blurt out an answer from pattern-matching. It searches through possibilities, evaluates them against an objective, and picks the one that scores best. This, LeCun argues, is intrinsically more powerful than next-token prediction. You can reduce any computational problem to an optimisation problem, but you can’t efficiently reduce every problem to predicting the next word.
His architecture puts a world model at the centre. The system imagines a sequence of actions, runs them through the model to predict their consequences, checks whether those consequences satisfy the goal, and — crucially — checks whether they violate any “guardrail” constraints (don’t hurt anyone, don’t crash the car). By construction, such a system can’t knowingly do something dangerous, because any dangerous plan fails the guardrail check before the action is ever taken. Engineers have been doing a hand-written version of this since the 1960s — NASA uses it to plan rocket trajectories, and it’s called Model Predictive Control. LeCun’s move is to learn the model from data instead of writing it by hand.
The ten-year failure that became JEPA
For fifteen years, LeCun tried to train systems to predict video the same way LLMs predict text. Show the model the first half of a video, ask it to predict the pixels of the second half. For ten of those years, it was a flat-out failure.
Here’s the reason. When you predict the next word in a sentence, there’s only a finite dictionary — maybe fifty thousand possibilities — and you can produce a clean probability distribution over them. But the next frame of a video? Infinite possibilities. If I turn a camera across this room and stop, no system on earth can predict what each of your faces looks like, what the wood grain of the panelling is, where the fireplace sits. Most of that information is fundamentally unpredictable at the pixel level. Systems trained to try produce blurry averaged mush, because they’re spending their entire capacity trying to guess things that cannot be guessed.
But — and this is the clever part — most of the world is partly predictable at a higher level. You can’t predict exactly where a falling pen will land, but you can predict it will fall. You can’t predict the texture of the carpet, but you can predict there will be a floor.
So instead of trying to predict pixels, LeCun’s team started predicting abstractions. They built something called a JEPA — Joint Embedding Predictive Architecture — and the idea goes like this. Take the first half of the video, run it through an encoder that strips out all the un-predictable detail and keeps only the abstract gist. Do the same with the second half. Now train the system to predict the abstract gist of the second half from the abstract gist of the first. Throw away the pixels; predict the structure.
Think of it like this. If I ask you to describe what happens next in a movie scene, you don’t hallucinate every pixel of the next frame. You say, “she walks out the door, it’s probably still raining, she’ll probably look for a taxi.” You compress the world into the shape of its predictable pieces and make your forecast there. That’s JEPA.
The technical wrinkle (stay with it)
There’s a catch. If you just tell the system “predict the gist of the second half from the gist of the first half, and minimise the error,” the system will cheat. It will learn to produce the same boring constant gist for every input — same gist on both sides, zero prediction error, problem solved, task pointless. This is called representation collapse.
The fix is that you also need to force the encoder to keep as much information as possible about the input, so it can’t just flatten everything to zero. But here is LeCun’s punchline: in mathematics, we only know how to compute upper bounds on information content, not lower bounds. And when you want to push something up, you need a lower bound to push against. “So what do we do? We come up with good upper bounds and then we cross our fingers. I’m sorry to say that’s what we have.” The whole field is held together by clever tricks. A method called SigReg — partly developed at Brown, by his colleague Randall Balestriero in the audience — shapes the encoder’s outputs into something close to a tidy Gaussian distribution, which mathematically implies the features are independent and therefore carrying information.
It sounds hacky because it is hacky. But V-JEPA, a version trained on roughly a hundred years of video (about a day of YouTube uploads), starts to behave in a way no LLM has. Show it a video where a thrown ball suddenly vanishes, or turns into a cube, or freezes in mid-air — the system’s internal prediction error spikes through the roof. It knows this can’t happen. That is the closest thing to physical common sense anyone has produced.
Hierarchical planning and the honest admission
Real planning is layered. When LeCun says he’s planning to go from his office in New York to Paris, he is not thinking about millisecond-by-millisecond muscle contractions. He thinks “catch a plane” → “go to the airport” → “take a taxi” → “go down to the street” → “get in the elevator” → then, finally, the specific hand movements. Humans do this reflexively. Getting a machine to do it is, he says, one of the great unsolved problems. “Nobody knows how to do hierarchical planning. This is an idea. It’s not been tested. We’re working on it. But you should also work on it if you are interested in this question.” A rare thing: a Turing Award winner telling a room of PhD students that he hasn’t figured it out and would they please help.
The company and the wider pitch
Three weeks before the talk, LeCun launched a new company — AMI Labs (pronounced “ah-mee,” French for “friend,” also short for Advanced Machine Intelligence). It’s valued at north of a billion dollars. The plan, stated with deadpan modesty: “to become basically the main provider of intelligent systems. It’s very modest. Okay? Not ambitious at all.” The early customers will be industrial — chemical plants, jet engines, power plants, places with thousands of sensors where predicting the system’s behaviour matters more than chatting about it. Eventually, he thinks, the same architecture will be inside your smart glasses and your domestic robot.
The Q&A — a few good exchanges
On whether LLMs will just eventually pass every benchmark anyway: A student asks whether the distinction between real cognition and sophisticated pattern-matching becomes philosophically meaningless once an LLM passes everything you throw at it. LeCun’s answer is sharp. There is no finite test you can design that the next LLM won’t solve, because whatever question you ask today gets scraped into tomorrow’s training set. When he gave the example on Lex Fridman’s podcast of a physical intuition question LLMs couldn’t answer, within six months OpenAI had trained its model on exactly that question. “It’s just information retrieval.” The real test, he insists, is whether the system can solve a problem it has never seen.
On whether industry will ever stay open: A PhD student calls out that every big AI lab eventually closes up. LeCun agrees that open versus closed tends to be a binary — you can’t be half-open — and commits, at least at the research layer, to an open model. He points to FAIR Paris under his watch hosting forty resident PhD students at any one time, which seeded the whole Parisian AI ecosystem and produced Llama 1 (built by twelve people in Paris, he notes, “American dollars, French technology”). DeepMind’s Paris office, he says, couldn’t do the same because of IP paranoia, and their ecosystem impact was much smaller. The implicit argument: openness is a strategy, not a charity.
On what students should study: The question is about whether AI makes a degree pointless. LeCun’s answer is the opposite — more industry demand for PhDs than ever, especially in STEM, because technological progress increasingly depends on scientific breakthroughs. His concrete advice: pick courses with long shelf lives. Between mobile app programming and quantum mechanics, take quantum mechanics. The math of Bayesian inference is the same as statistical physics. Learn the fundamentals. Learn to learn.
On music: A surprise. LeCun played Renaissance and Baroque music in his teens, then Breton folk music (Brittany — western France, Celtic roots, similar to Welsh), then jazz, then built himself electronic wind instruments to control his synthesiser collection because he can’t play keyboards. He got into electronics through music. Then into computer science. Then, eventually, into the architecture of intelligence.
Key Takeaways
- LLMs manipulate language. They do not understand the world. Passing the bar exam and failing to predict that an object falls off a table are not contradictions — they are symptoms of a model trained on the wrong substrate.
- Sensory data dwarfs text. A four-year-old takes in as much raw information through their eyes as every word on the internet contains. Text is a low-bandwidth slice of reality.
- Reasoning is search, not chatter. The “chain of thought” style of LLM reasoning is a trick that approximates computation by producing many tokens. Real reasoning is optimisation — searching through a space of possible solutions against an objective.
- World models predict abstractions, not pixels. JEPA’s insight is that most of the world’s detail is unpredictable and should be thrown away. Learn the gist, predict the gist.
- Hierarchical planning is unsolved. Humans plan by layering abstract actions over concrete ones. No AI can do this yet. LeCun is openly asking for help.
- Safety comes from architecture, not alignment. A world model with explicit guardrail objectives cannot knowingly produce a dangerous action, because every action is checked before it is taken. This is structurally safer than “trust the LLM.”
- Study fundamentals. Quantum mechanics over mobile app programming. Technology half-lives are short; mathematical structure outlives any specific framework.
Claude’s Take
LeCun is being a little too confident and a little too right at the same time, and it’s worth separating the two.
The “right” part is hard to argue with. The data-volume argument is devastating when you sit with it — sensory information really does outweigh text by orders of magnitude, and there is no clean path from more Reddit to a robot that folds laundry. His diagnosis of LLM “reasoning” as next-token production dressed up in longer outputs lands cleanly, and the JEPA insight — that you should predict abstractions rather than pixels — is a genuine conceptual step forward. The fact that V-JEPA produces surprise signals when physics breaks in a video is, to my eye, a more interesting result than most of what has come out of frontier LLM labs in the past year. And the architectural safety argument is underrated: building guardrails into the planning loop is structurally different from hoping a trained model stays polite.
The “too confident” part is that LeCun has been saying LLMs will hit a wall for several years now, and the wall keeps receding. He predicted in 2022 that LLMs would plateau; they didn’t. He predicted they couldn’t do math; they won math olympiads. His fallback — “you just trained the model on that question” — is technically true but starts to feel like a moving goalpost of his own. If the next generation of LLMs answers every physical-intuition question you can think of, at what point does the distinction between “understanding” and “sophisticated pattern-matching over an enormous training set” become a distinction without a difference? The student who asked him this basically nailed the tension and he sidestepped it.
The honest read is that LeCun is almost certainly right about one thing — current LLMs are not a complete architecture, and something that understands the physical world through something like JEPA will eventually be part of the stack. Whether that means LLMs are a dead-end branch, or whether they become one component of a bigger system alongside world models, is where I’d bet he is overstating. The hundreds of billions being poured into LLM labs aren’t wasted, and they probably aren’t a cul-de-sac either — they are one leg of a larger body that hasn’t grown yet. LeCun is building the other leg. He’s just louder about it than most.
The meta-point: watch what this man does, not what he says. He is genuinely open-sourcing, genuinely training PhDs, genuinely admitting hierarchical planning is unsolved, and genuinely building the alternative architecture in public. Score 8. Worth the hour and forty minutes.
Further Reading
- Yann LeCun, “A Path Towards Autonomous Machine Intelligence” (2022) — the long vision paper he references. Relatively accessible for a vision paper.
- Emmanuel Dupoux — cognitive scientist who put together the infant cognition chart LeCun showed. His work on what babies learn when is the empirical grounding for the world-model argument.
- Hans Moravec’s paradox (1988) — things easy for humans are hard for AI, things hard for humans are easy for AI. Still the cleanest framing of the whole field’s mismatch.
- Newell and Simon, “General Problem Solver” (1958) — the original formulation of reasoning as search through a solution space. LeCun’s world-model planning is a direct descendant.
- AlphaFold and ESMFold — both referenced as examples of learned models that incorporate deep domain knowledge without being fully reducible to equations. Good case studies for where AI-for-science is going.