DeepMind's New AI Found A Strange New Way To Think

ELI5/TLDR

DeepMind built an AI system that cracked nine math problems that nobody had solved in over 50 years — for a couple hundred dollars each. The clever part isn’t a smarter AI. It’s that they took an AI that lies constantly and wrapped it in a referee system that catches the lies. The AI fails thousands of times, and a judge slowly sorts the slightly-less-wrong attempts from the very-wrong ones until one actually works. A reliable answer, built out of an unreliable machine.

The Full Story

The headline that sounds like a failure

The system is called AlphaProof Nexus. It was pointed at roughly 350 of the open problems left behind by Paul Erdős — a famous mathematician who scattered over a thousand unsolved puzzles into the world. It solved nine. That’s a 95.7% failure rate.

The narrator’s point is that the failure rate is the wrong thing to stare at. These nine problems had gone unsolved by every human for decades. Getting any of them, for a few hundred dollars apiece, is the remarkable bit.

“We went from can’t even add numbers to solving decades-old open problems in the span of four years.”

That’s the framing: four years ago these models couldn’t reliably add two numbers; now they’re chipping at half-century-old puzzles. The trajectory matters more than today’s score.

The honesty problem, and the fix

Here’s the core obstacle. If you ask a normal AI assistant to prove something, it will often just make things up — confident, fluent, and wrong. Mathematicians call that hallucination. You cannot build trustworthy mathematics on a machine that bluffs.

The fix, which isn’t new, is to make the AI write its proofs in Lean — a formal math language where a computer can mechanically check whether a proof is actually valid. Think of Lean like a spell-checker that can’t be fooled: either the proof compiles as correct, or it doesn’t. No charm, no benefit of the doubt.

Everyone uses Lean now. So what’s actually new here is the loop around the AI.

The tournament — the “strange new way to think”

This is the genuinely clever idea, so go slow.

A mathematician writes the problem in Lean and leaves the proof blank. The AI tries to fill it in. It fails — too hard. A second AI critiques the attempt and explains why it’s bad.

Then comes the trick. A cheaper judge AI reads two failed attempts side by side and picks the one that’s slightly better. Both can be wrong. It just decides which is less wrong.

Now imagine that as a chess tournament. Each attempted proof is a player, and every player gets a rating — an ELO score, the same kind chess uses to rank skill. Bad proofs lose to slightly-less-bad proofs and sink down the rankings; the better ones climb.

“We start out from the highest scoring bad solution. So, this is now a tournament. Do this over and over again.”

You don’t restart from scratch each time. You take the best of the bad attempts and try again from there, over and over, the rankings nudging the system uphill. Eventually the Lean checker — the one referee who can’t be fooled — says yes, this one’s correct. Done.

“A reliable system built out of unreliable parts. I love that.”

The bigger shift

The narrator’s larger claim: the era where you got better results only by making the model itself smarter may be ending. Now you get better results by building a tighter harness around it — a good judge and many, many attempts.

“The intelligence is not just in the model, but it is in the loop around it.”

Where it cracks

To his credit, he lists the limits. The 350 problems were a hand-picked subset — likely the ones easiest to translate into Lean — out of the full ~1,200. And smaller, cheaper models solved exactly zero. You still need an expensive, powerful model at the core; the loop alone won’t save a weak engine.

Key Takeaways

AlphaProof Nexus (DeepMind) solved 9 of ~350 attempted Erdős problems — a 95.7% failure rate, but on problems unsolved by humans for up to 56 years, at a few hundred dollars each.
Proofs are written in Lean, a formal language a computer can mechanically verify — this kills hallucination, since a wrong proof simply won’t pass the checker. This part is standard practice now, not new.
The novel piece is a tournament loop: a cheap judge AI compares two failed attempts and ranks them with an ELO-style score; the system restarts from the best “bad” attempt and iterates until the verifier confirms a valid proof.
Core idea: a reliable system built from unreliable parts — the model can bluff freely, but the loop and the unfoolable Lean checker extract a correct answer.
Reframed thesis: progress is shifting from “make the model smarter” to “make the harness/loop around it tighter.”
Limitation 1: selection bias — the 350 were likely the subset easiest to formalize, out of ~1,200 total Erdős problems.
Limitation 2: smaller/cheaper models solved zero — a powerful (expensive) model is still required at the core.
Open trade-off raised: a bigger model with fewer tournament rounds vs. a smaller model with more rounds, at equal cost.

Claude’s Take

The underlying result is real and the central idea is genuinely elegant. “Reliable output from an unreliable model plus an unfoolable verifier” is one of the more important patterns in current AI, and the ELO-tournament framing makes it click. Wrapping a hallucinating generator in a formal checker and a ranking loop is how you get trustworthy math out of a bluffing machine. That part is not hype.

Now the deflation. The channel’s whole shtick is relentless enthusiasm — “hold on to your papers,” “what a time to be alive” — and the “can’t even add numbers… can’t even solve 50-year-old problems” riff is rhetorical sleight of hand designed to make any skepticism sound foolish in advance. Solving 9 of 350 cherry-picked, easy-to-formalize problems is a real milestone, but it is not “AI is now doing original mathematics.” The hardest, messiest Erdős problems — the ones that resist being written in Lean at all — are exactly the ones not attempted. And the honest admission that small models scored zero quietly punctures the “it’s all in the loop now” narrative: the loop matters, but only bolted onto an expensive frontier model.

Score 7: a clear, well-told explanation of a genuinely clever method, with limitations stated rather than buried — but it’s a short news-segment riff on someone else’s paper, wrapped in a sales pitch, not a deep treatment. Watch it for the tournament idea; mentally subtract the exclamation points.