Accelerating Science with AI

ELI5/TLDR

Kevin Weil, until recently president of OpenAI for Science, walks through what large AI models can now do in actual research labs — not the chatbot stuff, the hard stuff. In three years the same family of models went from flubbing high-school SAT math to autonomously solving open research problems in fields like algebraic topology and particle physics. The pitch is simple: this is not the future where the robots do the work and we sit around. It is the future where a postdoc who used to need ten grad students now has a tireless collaborator that has read every paper ever written, and can spawn ten parallel copies of itself to chase every hunch she has.

The Full Story

The three-year jump nobody quite registers

Weil opens by tracing a single benchmark — AI performance on math competitions — across three years. In 2023, the original GPT-4 scored 700 on the math SAT. Pretty good, better than 90% of US high schoolers. A year later, on the much harder AIME (a top high-school competition), the next model scored 9%. A strong F.

Then, late 2024, OpenAI released the first “reasoning” model, O1. Same family, but with a twist: instead of blurting out an answer the instant you ask a question, the model is allowed to pause and think. Try different angles. Work backwards. Reduce the problem to something simpler. The thing you or I would do if asked a hard question.

This single change — letting the model spend compute at question-time, not just at training-time — added a second knob to scale. O1 hit 74% on the AIME. By mid-2025, an internal OpenAI model won a gold medal at the International Math Olympiad, putting it in the top five humans on Earth at that competition. By the time of the talk, the models are essentially perfect on AIME.

The point is not the scores. The point is the slope. Weil compares it to the first time you ride a Waymo: the first thirty seconds you are clutching the seat, the next two minutes you are amazed, and ten minutes later you are bored on your phone wondering why traffic is so bad. We adjust to the miraculous extremely quickly.

Stochastic parrots, and why they were wrong

For years the standard critique was that language models are “stochastic parrots” — they sample from the distribution of stuff they were trained on, so by construction they cannot do anything genuinely new. Just rearrangements of what they have already seen.

Late 2025 OpenAI co-authored a paper with about ten outside academics to put a flag in the ground: here is where AI actually is in scientific work. Most of it was modest — using AI for literature search, as a brainstorming partner, that sort of thing. But the last chapters had something different. The AI had actually solved an open problem.

Specifically, an Erdős problem. Paul Erdős was a wildly prolific mathematician who lived on coffee, amphetamines, and the couches of every collaborator he could find, and who left behind roughly 1,200 named open problems. About 700 are still unsolved. OpenAI’s mathematicians cracked one of them with the model.

Then a strange thing happened. Within a couple of months, mathematicians all over the world started using the same model to knock off other Erdős problems. Some turned out to have been solved already, hidden in papers nobody had connected. Others were genuinely new. The mathematician Terry Tao made the sharp observation that many of these problems had not been bottlenecked on cleverness — they had been bottlenecked on attention. Nobody had taken the time to look at them with the right tools.

A real research problem, sealed in an envelope

Some skeptics said: fine, low-hanging fruit. So a group of mathematicians ran what they called “First Proof.” They posted ten genuine research-level questions across fields like spectral graph theory, symplectic geometry, and stochastic analysis — questions that had emerged naturally in the authors’ own work, each with a proof under five pages, with the solutions encrypted and locked away for a week. No humans in the loop. AI only.

Bets were placed that nobody would get more than two. OpenAI’s internal model, autonomously, got five.

Weil flags something subtle here: he says “we believe we got five,” not “we got five.” Verifying whether an AI’s proof is actually correct turns out to be its own hard problem — more on that later.

Particle physics, and the term that wasn’t supposed to exist

Weil pivots to physics with a story that lands harder than it should. For decades, particle physicists believed that a particular scattering amplitude — the one with one negative-helicity gluon and the rest positive — was zero. It was in textbooks. The version with two negative-helicity gluons had a name precisely because it was “maximally” violating; the zero-and-one cases were assumed dead.

Andy Strominger, one of the top string theorists alive, suspected there was a region where the supposed-to-be-zero term was not actually zero. But proving it required calculations that grew exponentially nasty as the number of particles increased. The kind of thing that eats grad students for breakfast.

Strominger and his collaborators flew to OpenAI to try the new model on this. On the plane over, one of them started feeding the problem to GPT-5.2 Pro on his laptop. The model guessed a closed-form answer for arbitrary numbers of particles, but could not prove it. The team handed the problem to an unreleased internal model. Before the physicists had even landed, it had proved the formula. They spent the week verifying the AI’s work instead of doing the work themselves.

The other physicist Weil quotes, Nima Arkani-Hamed, made an observation that sticks: AI tends to be very good at finding hidden simplicity inside apparent complexity — and hidden simplicity is usually the fingerprint of an underlying symmetry that nobody had noticed.

Robot labs that pipette while you sleep

The third example moves from theoretical work, which can be done in pure thought, to physical experiments, which involve atoms. A recent paper described an “autonomous lab”: an AI trained on biology, given a goal (in this case, cheap cell-free protein synthesis), and connected to a fully robotic cloud lab.

The model thinks. It designs experiments. It simulates them in its head. When it has a candidate worth trying, it tells the robots what to do. The robots run the experiment, measure the results, and feed the data back. The model updates. Repeat.

In one such loop they ran 36,000 experiments — far more than any human team. They cut the cost of protein synthesis by 40% while increasing yield by 27%. Crucially, every layer of this scales. More compute on the AI side. More robotic stations on the wet-lab side. The bottleneck of “how many postdocs can you afford to pipette things” simply dissolves.

Weil quotes a Berkeley linguist (who, by the way, is decoding sperm whale vocalizations and has determined that sperm whales have vowels): “AI is a metal detector for hypotheses.” We humans generate way more ideas than we have time to test. An infinitely patient collaborator that has read every paper in your field can sweep across all of them and tell you which beep loudest.

What still does not work

Weil insists he is not here to oversell. Three things are genuinely hard.

Verification. When the model spits back “I solved 200 of the 700 open Erdős problems,” it is almost certainly wrong about most of them — but you cannot tell which until a human checks. AI-written proofs can be subtly broken: they cite a lemma that does not quite apply, or invent one outright. The bottleneck has shifted from attention to verification. The proposed fix is formal proof systems like Lean, where a correct proof is mechanically checkable. The catch is that formalizing mathematics from the ground up — defining what a topology is before you can verify a statement about fundamental groups — is itself an enormous undertaking that the math community is still chewing through.

Unconventionality. For most consumer uses of AI (“summarize this email”), you want the safe, dead-center, most-probable answer. For genuinely hard research problems, the opposite is true: every conventional angle has already been tried by smart humans. Breakthroughs live in the weird low-probability corners. AI models, by default, are trained to head for the middle of the fairway. Training them to deliberately wander into strange territory is an open problem.

Invention. No model has yet produced something like Grothendieck inventing algebraic geometry, or Einstein inventing general relativity from scratch — discoveries that do not just solve a known problem but spawn whole new fields. Weil thinks it is coming. Not yet.

The line to remember

He ends on a one-liner he wants the audience to walk out with: the AI model you are using today is the worst AI model you will ever use for the rest of your life.

He thinks 2026 will be for AI in science what 2025 was for AI in software engineering: at the start of 2025, using AI for most of your code made you an early adopter; by the end, not using it for most of your code meant you were falling behind. He thinks the same flip is happening right now for scientific work. And taking it further: it is plausible that AI lets us do the science of 2050 in 2030.

Key Takeaways

Reasoning models (post-O1) added a second scaling axis: compute at question-time, not just training-time. The jump from 9% to 74% on a top math competition came mostly from letting models think.
AI has already solved real open problems in mathematics (Erdős problems) and particle physics (a scattering amplitude long assumed to be zero). Not just plausibly — actually.
Autonomous robotic labs can run tens of thousands of experiments per project. The rate-limiting step in physical science is shifting from human hands to machine throughput.
Expertise still wins. A real physicist plus AI vastly outperforms a non-physicist plus the same model. The model is a power tool; aim still matters.
Verification is the new bottleneck. When a model proposes 200 maybe-correct proofs, somebody has to check them. Formal verification systems (Lean) are the long-term answer for math; other fields will need their own.
AI tends to find hidden simplicity inside apparent complexity, which often points at an unnoticed symmetry. This is exactly the kind of thing humans miss because we get tired and bored.
The frontier is still moving. No plateau visible. Internal models are already more capable than public ones; the people building them see clear paths to further gains.

Claude’s Take

Weil is a good talker and an unusually credible one for this beat — he ran science at OpenAI, and he has most of a physics PhD, so he is fluent in both the model side and the bench-scientist side. The talk does what you want it to: it gives you concrete examples instead of vibes. The Strominger physics story and the Erdős-problem story are both verifiable, both recent, and both genuinely impressive.

The optimism is also slightly suspect, in the specific way that all talks given by senior people at frontier labs are suspect. Notice what is missing: any serious discussion of fraud, of the verification problem at scale (one mathematician quietly going through 200 maybe-proofs is a different beast from a journal getting flooded with AI-generated submissions), of what happens to the training pipeline for human scientists when the early-career grunt work gets automated away. He gestures at the knowledge-gap question when an audience member raises it, and his answer — “this is just abstraction, like calculators or high-level programming languages” — is not wrong, but it is also exactly what someone who works at OpenAI would say.

The strongest part of the talk is also the most concrete: the verification problem. He clearly understands it is the real bottleneck and that the rest of his optimism depends on solving it. The weakest part is the closing — “AI will let us all create anything we want, no excuses” — which is the standard frontier-lab pep talk that does not really survive contact with the actual distribution of human energy and attention.

But strip out the salesmanship and the substance is real. Three years ago this stuff did not work. Now it does, in narrow but expanding domains. That is worth taking seriously.

Score: 8/10. Not novel as a thesis if you have been paying attention, but the specific examples are fresh, the storytelling is clean, and the verification framing is genuinely useful.