What rebuilding AlphaGo teaches us about self-play, RL, and future of LLMs - Eric Jang

ELI5/TLDR

Eric Jang spent his sabbatical rebuilding AlphaGo from scratch — the system that beat the world’s best Go player in 2016 — and got a strong bot running for about ten thousand dollars of cloud compute, work that originally cost DeepMind millions. Across two and a half hours he walks Dwarkesh through how the algorithm actually works: a small neural network learns to “glance” at a Go board and guess both who’s winning and which moves are worth considering, and a tree search uses those glances to look ahead without exploring the whole impossibly large game tree. The deep idea is that ten layers of a neural network can compress a search that would otherwise take more atoms than exist in the universe — and once you see that, lots of “intractable” problems start looking less intractable. The conversation ends with why this kind of clean, self-improving training loop is so much more elegant than how language models are currently trained, and what would have to change for that gap to close.

The Full Story

Why anyone should care about a fifteen-year-old game-playing AI

The setup is simple. Eric Jang used to run AI at a humanoid robotics startup, before that did robotics research at Google DeepMind. He took a few months off and chose to rebuild AlphaGo instead of going to the beach. The reason matters: AlphaGo was the first system that showed a deep neural network could solve a problem everyone had said was impossible. Go has roughly three hundred and sixty-one possible moves at any point and games run three hundred moves deep, so a full search tree has more leaves than there are atoms in the observable universe. For decades, computer scientists wrote it off as a problem this century could not crack. Then in 2016 a ten-layer network did exactly that.

Jang’s point throughout the conversation is that we still do not really understand why this works, and the answer matters for everything downstream — from protein folding to language models to whatever comes next.

“A ten-layer neural network pass, basically ten steps of reasoning, is able to amortize and approximate to very high fidelity a nearly intractable search problem. This was a breakthrough that I think most people don’t even fully comprehend today, how profound that accomplishment is.”

The game itself, briefly

Go is two players, black and white, taking turns placing stones on a nineteen-by-nineteen grid. You capture by surrounding. You score by counting territory. The game ends when both players pass or one resigns. The interesting thing is that there is no local reward — you cannot tell from any single move whether you’re winning. You only know at the end. That property is what makes Go hard, and it is also what makes the AlphaGo solution interesting.

There is a clean rule variant called Tromp-Taylor scoring that resolves the game with zero ambiguity, which is what computers train against. Humans use a slightly fuzzier version where both players have to agree the game is done.

The naive way: search the whole tree

Imagine you could afford to look at every possible future from the current board, all the way to the end. You would simply pick the move whose subtree contains the most winning leaves. This is the whole game, solved. The problem is you cannot afford it. The tree is too big.

So computer scientists invented Monte Carlo tree search, or MCTS. Instead of building the whole tree, you grow it interactively, focusing your attention on the branches that look most promising. The key data structure is a node that stores three things: how many times you’ve visited it, the average chance of winning if you go down this path, and a list of children.

The algorithm that decides which child to visit next is called PUCT. Think of it as a tug-of-war between two instincts. One instinct says “go where I’ve already seen good results” — that’s the exploit term. The other says “go where I haven’t looked enough yet, just to be safe” — that’s the explore term. Early in the search, the explore term dominates and you spread out. As you accumulate visits, the exploit term takes over and you focus on the best path.

This was the state of the art before AlphaGo. It worked, but not well enough to beat strong human players. The breadth and depth of the tree were both too much.

The trick: replace the bottom of the tree with a guess

AlphaGo’s contribution was to train two neural networks that act like the intuition humans use when they look at a board. Think of how a grandmaster glances at a position and says “I’m losing this.” She is not playing out the next hundred moves in her head. She is running a pattern-matcher trained over a lifetime of games. AlphaGo learns to do exactly that.

There are two networks, often combined into one with two heads:

A value network that takes a board and predicts the probability of winning from this position. Just a number between zero and one.
A policy network that takes a board and gives a probability distribution over good moves. Three hundred and sixty-one numbers that sum to one.

The architecture itself is not particularly important. Jang tried ResNets and Transformers and reports that for small training budgets, ResNets (think: image-style convolutional networks with local pattern detectors) win. This is because Go positions have strong local structure — what’s happening in one corner mostly stays in that corner. Transformers shine when you need global reasoning over long context, but at the size of board Go uses, the local-bias of convolutions is just more efficient per unit of compute.

These two networks let you do MCTS with massive shortcuts. The value network truncates the depth — you no longer need to play to the end of the game to score a position, you just ask the network. The policy network truncates the breadth — instead of considering all three hundred and sixty-one moves at each node, you focus on the handful the policy thinks are worth considering. The intractable search becomes tractable.

“We take the idea that humans can glance at a board and instantly predict whether we win. That maybe gives us the opportunity to truncate how deep we search.”

The four-step dance: select, expand, evaluate, back up

Every move during a real game, the AI runs anywhere from a few hundred to tens of thousands of simulations. Each simulation is one trip down the tree, and it follows a four-step recipe.

Select. Start at the current board and pick the best child according to the PUCT formula. Walk down. Pick again. Keep walking until you hit a node that’s never been expanded.

Expand. When you arrive at a fresh leaf, ask the policy network “what moves look good from here?” Use those probabilities to create new children. This is where the tree grows.

Evaluate. Ask the value network “how likely am I to win from this leaf?” That number is your guess for this branch.

Back up. Walk back up the path you came down, updating each ancestor’s running average to include this new estimate. Each node now has a slightly better sense of how good it is.

Run this loop a few thousand times and you end up with a tree that has a few favorite paths visited heavily, with everything else trimmed. The final move you actually play is the one with the highest visit count from the root. Then — and this part still feels strange the first time you hear it — you throw the whole tree away and start over on the opponent’s next move.

The recursive magic: train yourself on your own search

Here is where the system becomes self-improving, and the explanation is worth slowing down for.

Before search, the policy network has some guess about the next move. After running a thousand simulations of MCTS, you have a much better, sharper guess. The MCTS output is essentially a better version of what the policy was trying to say.

So you train the policy to imitate the MCTS output. Why have the search do all that work every time when the network could learn to just predict the answer directly? Each round of training, the network starts a little smarter. Then you run MCTS on top of the smarter network, which gives an even better output. Train again. And so on.

This is the heart of AlphaGo’s self-improvement. The search acts as a teacher, the network as the student. The student gradually absorbs what the teacher knew, freeing the teacher to discover new things on top.

“Instead of having MCTS do all this legwork to arrive here, why don’t you just predict that from the get-go?”

Jang draws a picture. Imagine a plot where the x-axis is how many simulations you run at test time and the y-axis is your win rate. Without distillation, the curve starts low and climbs as you add compute. After distillation, the curve has been shifted up — the network starts where it used to need a thousand simulations to reach. Spend another thousand simulations on top of that and you reach a new, higher ceiling.

Why AlphaGo is more elegant than how we train language models

This is where the conversation gets interesting, and where Jang’s complaint about modern LLM training crystallizes.

When you train a language model with reinforcement learning today — say, on a coding task — the model writes out a whole answer, you check if it works, and you give it a single reward at the end. If the answer is right, you nudge the model toward producing that whole trajectory. If it’s wrong, you nudge it away. Andrej Karpathy memorably called this “sucking supervision through a straw.” All the information from a long, complex task gets compressed into a single yes-or-no signal.

AlphaGo does something fundamentally different. On every single move in every single game, MCTS gives you a strictly better label than the one your policy was about to produce. You are not waiting for the end of the game to find out which moves were good. You are getting move-by-move improvement signals, every move. This means the variance is dramatically lower and the model learns much faster per unit of data.

Jang’s most striking observation about this comes when he tries to explain why LLM training is even worse than the obvious “sparse reward” complaint suggests. Think about what happens when you train a language model from scratch with reinforcement learning. Vocabulary is around a hundred thousand tokens. You ask the model to complete “The sky is…” and it spits out “halycon” or “told” or any random word. Almost every guess is wrong. The chance of accidentally producing “blue” before any training has happened is one in a hundred thousand. So almost every sample gives you zero learning.

In supervised learning, by contrast, you tell the model “the answer is ‘blue’” and it can immediately compute how far off it was. The amount of information you learn per sample is much higher.

There is a beautiful way to formalize this. If your “pass rate” — the chance your policy happens to produce the right answer — is very low, supervised learning gives you a lot of bits per sample (technically, negative log of the pass rate). Reinforcement learning gives you the entropy of a binary random variable, which goes to zero at both extremes. So when your pass rate is one in a hundred thousand, supervised learning teaches the model a lot per attempt, and RL teaches it almost nothing. And you spend most of training in exactly the regime where RL is at its weakest.

“It’s a depressing plot in the sense that once you’re here, it’s not at all obvious how you get to there.”

AlphaGo sidesteps this entirely. It never has to live in the cold zero-percent-success-rate regime. It always has an improved label to learn from. Every step of the training is supervised learning on a slightly better target. There is no exploration crisis to escape.

Why the AlphaGo trick doesn’t easily port to language models

If MCTS is so much better, why don’t we use it for language models? Several reasons.

In Go, the breadth is bounded (three hundred and sixty-one moves) and the depth is bounded (three hundred moves) and the value at the bottom of the tree is decidable (you can play out the game and just count). For language, every “move” is a token, the vocabulary is huge, and there is no clean way to score a partial answer. You cannot easily truncate the search the way the value network does for Go.

Worse, in Go you almost never visit the same child twice within a single search, because each board position is unique. The PUCT exploration formula relies on counting how often you’ve taken an action, and that count rarely exceeds one in a language model setting. So the heuristic that makes MCTS work doesn’t transfer cleanly.

Jang thinks something LLM-flavored will eventually come back to forward search — particularly in mathematical reasoning, where logical proofs have more of a tree shape than open-ended conversation. But he is not betting that today’s MCTS algorithm will be that something.

Self-play and how it relates to AlphaStar, Dota, and other games

In games where you cannot easily simulate a search — StarCraft, say, where you do not have a clean model of the game dynamics — there is an alternative trick called neural fictitious self-play. You fix one player, train another to beat them using standard model-free reinforcement learning, then add that new player to a league of opponents and repeat. Over many iterations the league grows stronger.

This is what powered AlphaStar and OpenAI Five. It’s the same fundamental idea as AlphaGo — relabeling your training data with better actions — but the source of those better actions is “beat a fixed opponent” rather than “search the tree.”

Why off-policy training actually helps in AlphaGo

There is a beautiful tangent in the middle of the conversation. Most modern RL practitioners are obsessed with staying “on-policy” — only training on data the current model just produced. AlphaGo casually breaks this rule. Its replay buffer is full of games played by older versions of itself.

The justification connects to a robotics algorithm called DAgger, short for Dataset Aggregation. The idea is that you want your model to know two things: how to act in the situations it’s likely to encounter, and how to recover when something goes wrong and it ends up in an unfamiliar situation. A self-driving car needs to know how to drive straight, but also how to correct when a gust of wind nudges it off the lane.

A replay buffer of slightly-off-distribution states from old games serves exactly that purpose. As long as those states are not too far from where the current policy would actually visit, they teach the model robustness. Push it too far — train on states the policy would never reach — and you waste capacity.

“There was a funny quote about chess and Go. The problem with Go and chess is that the other player is always trying to do some shit.”

The bit about NP-hard problems and the smell of P versus NP

Roughly halfway through, the two of them stumble into territory that should make any computer scientist a little uncomfortable. Go is in a complexity class that was meant to be intractable. So is protein folding. So is matrix multiplication structure (the problem AlphaTensor cracked). And yet neural networks keep finding fast approximate solutions to these supposedly hard problems.

Jang is careful — this is not a proof of P equals NP. But it does suggest that the way we think about NP-hardness, as a worst-case statement, may not be how nature actually behaves. Most real problems we care about have structure. And structure is exactly what neural networks are good at picking up.

The analogy he likes is weather prediction. We cannot predict the exact position of every molecule of air a week from now — that’s chaotic. But we can predict where the hurricane will land. The macroscopic structure is stable even when the microscopic detail is not. Maybe a lot of “intractable” problems are like that, and neural networks are just very good at capturing macrostructure.

The bit about LLMs writing research code

Jang did most of the engineering for this project by directing Claude. He has opinions about what current models are good at and what they are not.

What works: a model can hill-climb a metric. Give it a fixed dataset, a fixed budget, and a clear target, and it will try a hundred small tweaks — different optimizers, different augmentations, different layer counts — and squeeze out real performance gains. This is a much richer form of hyperparameter search than the grid-search era allowed.

What doesn’t work: lateral thinking. Knowing when a research direction is exhausted and you should jump tracks entirely. Knowing whether a discrepancy is a bug or a real result. The kind of senior-researcher taste that says “this whole approach is wrong, let’s go back to first principles.” For that, Jang still had to be the human in the loop, asking the right question at the right moment.

“Often I had to catch infra bugs myself by prompting the right question to Claude to investigate what’s causing the discrepancy.”

He suspects this is fixable with better training environments. Go itself is one such environment — quick to verify, but rich enough to contain real research-engineering challenges underneath. You could imagine training models to be better automated researchers by having them iterate on Go projects.

The closing reflection

The deepest thing in the conversation is not a technique. It’s the recognition that AlphaGo solved something that should not have been solvable and we still do not fully understand why. A small network, very few parameters, almost no test-time compute compared to a frontier LLM, beating a problem with more positions than atoms in the universe.

If that’s possible, then “intractable” is a much fuzzier concept than computer science textbooks suggest. And the techniques that make it work — distilling search into a forward pass, bootstrapping from a value function, relabeling your own actions with a better teacher — may turn out to be more important to the future of AI than the specific algorithms we use today.

Key Takeaways

AlphaGo combines a tree search (Monte Carlo tree search) with two neural networks: a value network that guesses the probability of winning from any board, and a policy network that guesses which moves are worth trying.
The networks truncate the search in both directions — the value network removes the need to play games to the end, the policy network removes the need to consider every move.
The self-improvement loop is elegant: run MCTS, get a better answer than the raw policy network would have given, train the network to imitate that better answer, repeat.
Compared to LLM-style reinforcement learning, this is dramatically more efficient. Every move gets a better label. You never wait for an end-of-game reward. You never have to escape the zero-success-rate regime.
LLMs in their current form cannot easily use MCTS because their action space is too wide and there is no clean way to score partial outputs.
Off-policy training is not always bad. A replay buffer of slightly-off-distribution states teaches the model how to recover from mistakes — like a self-driving car learning to correct when wind pushes it off lane.
A ten-thousand-dollar weekend project can now match what cost DeepMind millions in 2016. Architecture choices, replay buffers, and most KataGo tricks turned out to be less important than just having faster GPUs and a good initialization.
The fact that NP-hard problems like Go and protein folding keep falling to neural networks suggests that worst-case complexity is the wrong frame. Real problems have structure, and that structure compresses into a forward pass.
Claude 4.6 and 4.7 are good at hill-climbing a fixed metric but bad at the lateral thinking that says “this whole direction is wrong.” Closing that gap is what would unlock real automated research.

Claude’s Take

This is one of the cleanest pedagogical pieces on a famous algorithm I’ve encountered. Jang teaches AlphaGo the way the best graduate-school professors teach — building up the naive solution first, showing why it fails, then introducing each fix one at a time so you can see why it has to be there.

The single most useful insight in the whole conversation, for anyone trying to understand modern AI, is the chart of “bits learned per sample as a function of pass rate.” It is the cleanest articulation I’ve seen of why current LLM reinforcement learning is so inefficient and why it tends to plateau in odd ways. The fact that you spend almost all of training in the regime where RL gives you almost zero signal per attempt — that’s the kind of observation that changes how you read the next year of papers from frontier labs.

The discussion of MCTS-versus-LLM-RL also clears up something that has been quietly nagging at the field for two years. People keep asking why we can’t just “do MCTS on top of an LLM” the way DeepMind did for Go. Jang’s answer is precise: the search heuristic that makes MCTS work in Go assumes a small discrete action space and a clean value signal at the leaves. Neither of those holds for language. Something else will eventually come back to claim that territory, but it won’t be PUCT applied to next-token prediction.

The weakest part of the episode is the segment on whether AI can fully automate AI research. Both speakers gesture at the right questions — verifiability of inner-loop signals, stackability of local improvements, the role of lateral thinking — but neither lands on much beyond “this is hard, we’ll see.” That’s honest, but it’s the part of the conversation where you can feel them reaching for an answer neither has.

The off-hand observation that NP-hard problems keep falling to neural networks because real instances have structure is the kind of throwaway line that I’d like to see someone write a real paper on. It’s correct, it’s important, and it’s not a thing the formal complexity theory community has fully metabolized yet.

Score: 9 out of 10. Loses a point only because the last twenty minutes meander into automated-research speculation that doesn’t quite earn its airtime. Everything before that is genuinely some of the best teaching on this topic available anywhere.