There Will Be a Scientific Theory of Deep Learning

ELI5/TLDR

Right now we build AI by trial and error. We tweak knobs, run training, see what happens, tweak again. There is no underlying theory that tells us why one setting works and another fails. Two researchers argue that’s about to change. They think deep learning is finally ready for its own version of physics — a set of clean, mathematical laws describing how learning actually moves through a model. They lay out five reasons it’s possible now, and they sketch the first bricks of the building.

The Full Story

Kanjun Qiu, the CEO of Imbue, sits down with Jamie Simon and Daniel Kunin, two young theorists who just published a perspective paper with twelve other co-authors. The paper has a bold title and a bolder claim: there will be a scientific theory of deep learning, and it’s starting to take shape right now. They call the emerging field learning mechanics.

The bridge metaphor

Start with the basic puzzle. When a civil engineer designs a bridge, the design choices go directly into the bridge. You decide on the steel, the cables, the truss pattern, and the bridge that gets built reflects those decisions. If it falls down, you can trace it back to your blueprint.

Deep learning is not like that. The engineers don’t design the model — they design the playground in which the model designs itself. You set the architecture, the data, the optimizer, the learning rate. Then you press go, gradient descent runs for a few weeks, and out pops a model. Your decisions only shaped the final product indirectly, through this messy training process. So when something breaks, it’s hard to know which knob caused it.

Civil engineers have a theory of materials. They can predict, before pouring concrete, roughly what the bridge will do. Deep learning has nothing like that yet. It has, in Jamie Simon’s wry phrase, “grad student descent” — a human-powered version of the same trial-and-error loop. Learning mechanics is the attempt to give the field actual predictive theory.

Mechanics versus interpretability

The paper draws a distinction. Mechanistic interpretability — the field that tries to identify “circuits” inside a trained network and figure out what they’re computing — is the biology of deep learning. It’s qualitative, semantic, anatomical. You poke at a neuron, you ask what it’s doing.

Learning mechanics is the physics. It’s quantitative. It cares less about what a particular neuron means and more about how the entire process of learning unfolds. How do parameters move through their high-dimensional space? What governs that motion? Which initial conditions matter, and which don’t?

“Learning is really a process of movement. Learning is changing parameters. And so it’s the model moving through some parameter space. And physics has spent centuries building up tools and ideas and thought processes on how to think about movement.”

That framing — learning as motion — is the seed of the whole project.

Why now

The honest answer is that fourteen people just finished their PhDs at roughly the same time, knew each other from neighboring institutions, and went to a cabin in the Berkshires for a week to argue about research. But there are real intellectual reasons too.

For years, deep learning practitioners treated theory as useless. The empirical results raced ahead. Every six months a new architecture or training trick would land, and the theorists would still be writing equations about what worked two years ago. But three things have shifted.

First, the field has converged. Most large models look broadly similar now — transformer-shaped, trained on next-token prediction, scaled up by familiar formulas. The chaos of the early 2010s, where someone might invent a wholly new architecture every month, has settled. There’s a stable thing to build a theory about.

Second, models have gotten huge. And huge systems, paradoxically, are easier to study. Imagine someone hands you a box with a hundred gas molecules in it and asks for a theory. You’d have to track every position and velocity — a nightmare. Now imagine they hand you a box with 10²⁰ molecules. Suddenly you can ignore the individual particles and just talk about pressure and temperature. The math becomes clean. Modern neural networks are big enough that the same trick is starting to work on them.

Third, theory has actually scored a few wins. The most famous is MUP (maximal update parameterization), a technique pioneered by Greg Yang. It tells you how to scale hyperparameters — like the learning rate — from a small model up to a giant one without re-tuning. Before MUP, you’d find good settings on a small model, scale up, and discover everything was wrong. With MUP, the small model becomes genuinely informative about the big one. Think of it like building a scale model of the Golden Gate Bridge that actually predicts how the real bridge will behave under wind load — the trick is knowing which non-dimensional quantities to preserve when you scale up. MUP is one of the first cases where a theoretical insight quietly went into shipping production AI systems.

The five pockets

The paper organizes the emerging theory into five clusters of evidence — five “pockets” where real progress is happening. The hope is that these pockets will widen and eventually link up into a comprehensive theory over the next decade.

Pocket one: Analytically solvable settings exist. You can take a deep neural network, strip out the nonlinearities — the bits that turn each layer’s output into something curvy and complicated — and you’re left with a deep linear network. Mathematically, a deep linear network is identical to a shallow one (multiplying a chain of matrices is the same as multiplying their product). But the learning dynamics are different. The deep version learns the most important directions first, then the next most important, then the next. This is a stripped-down toy, but it reveals a property that shows up in real networks too: a built-in bias toward simplicity. Networks tend to grab the strongest signal in the data first, then the next strongest. That bias toward parsimony is one reason they generalize at all.

Pocket two: Insightful limits reveal fundamental behavior. This is the most physics-flavored pocket and, both authors think, the most important. The technique: take some parameter — width, depth, learning rate, number of training steps — and push it to infinity or zero. In the limit, the messy contingent details often vanish, leaving the underlying structure exposed. Take the gradient flow limit — let the learning rate go to zero, but take infinitely many steps. What was a discrete jumpy process becomes a smooth differential equation, the kind physicists have been solving for centuries. Or take the infinite width limit — what happens as you make every layer arbitrarily wide? Two answers exist: the neural tangent kernel limit (mathematically beautiful but the network stops learning new features), and the MUP limit (harder but more realistic, preserves feature learning). The community has converged on the second as the right one to study.

The discretization hypothesis. Buried inside the limits discussion is a beautiful, almost philosophical claim. Practical deep learning, the authors propose, is best understood as a discretization of some ideal continuous system. When engineers simulate fluid flow, they don’t solve continuous equations — they discretize space into a mesh and let the fluid flow through the mesh. The finer the mesh, the better the simulation. Deep learning, by this view, is the same thing. The “true” model is some continuous platonic object. Real networks are discretizations of it: finite width, finite depth, finite data, finite step size. Scaling up doesn’t change what the model is doing, it just makes the mesh finer. This reframes the magic of “more data and more parameters keep working” — of course they do, you’re getting a higher-resolution simulation of the same underlying thing.

“We also don’t understand exactly how the water moves around in the glass, right? But we know it’s not going to jump out, right? Because it’s just this is how it mostly works.”

Pocket three: Simple equations capture macroscopic statistics. Sometimes you can’t derive the laws from first principles, so you go the other way — you observe empirical regularities and then try to explain them. This is how chemistry developed. Boyle noticed that pressure and volume of a gas are inversely related at constant temperature. Charles noticed pressure and temperature are proportional at constant volume. Eventually someone wrote PV = nRT. Only much later did the kinetic theory of gases explain why.

Deep learning has its own emerging laws. The most famous are the neural scaling laws — the empirical observation that loss decreases as a power law in model size, dataset size, and compute. These laws are currently driving every major training run in Silicon Valley. There’s also the edge of stability effect: as you train a network, a quantity called the sharpness (technically, the largest eigenvalue of the loss surface’s second-derivative matrix — a measure of how steep the loss is in its steepest direction) grows and grows, then plateaus at a value that’s exactly 2/learning rate. That number isn’t random — classical optimization theory predicts that 2/learning rate is the precise sharpness at which simple gradient descent becomes unstable. Networks somehow learn to balance right at the edge.

Pocket four: Hyperparameters can be disentangled. A modern model has dozens of knobs — width, depth, learning rate, batch size, attention heads, weight decay, dropout. They look impossibly tangled. But the field is finding that you can often pull them apart. MUP is the headline example: it teaches you how learning rate should scale with width, so the two stop interacting. Once disentangled, each knob can be studied on its own.

Pocket five: Universal phenomena appear across settings. This is the most surprising one. Train two completely different diffusion models on completely different image datasets. Hand them the same patch of random noise. As the models scale up, they start producing the same images. Different architectures, different data, same output. This suggests they’re converging to a shared underlying structure — the platonic representation hypothesis. Some researchers go further and claim that even vision models and language models, trained on totally separate data, end up with similar internal representations. If true, it means there’s a single “right” world model that any sufficiently rich learner will discover, and the model is just the discovery mechanism. The complexity, in this view, lives in the data, not the model.

The dream of a hyperparameter-free model

There’s a thread running through every pocket: take limits, simplify, strip things away, see what’s left. The dream is to keep doing this until you arrive at something with zero hyperparameters — a platonic ideal of a deep learning system. Width, depth, number of attention heads, learning rate — all gone, all sent to their natural limits. What remains? Maybe just a clean equation describing how a network ingests data and forms representations, like an ideal gas law for learning. Right now this is more of a research vibe than a concrete result, but it’s the direction the most ambitious people in the field are pointing.

What this would unlock

If the program succeeds, three things change.

Practically: less guesswork in training. Instead of running ten experiments to find a learning rate, you’d compute it. Instead of empirically discovering scaling laws, you’d predict them from first principles. The cost of building frontier models would drop, and smaller players could compete.

Scientifically: a real understanding of intelligence. Brains and artificial networks both learn from data. If we crack the math of one, we may finally have a lens for the other. Both authors hang out with neuroscientists and observe how much harder the problem is over there — you can write down beautiful theory but you can barely measure anything in a real brain. With artificial networks, every number is observable. That’s an unfair advantage.

Safety-wise: the ability to make grounded claims. Right now, when someone asks “is this AI system safe?”, the honest answer is some version of “we trained it for a while and it didn’t do anything bad in our tests.” That’s about as rigorous as saying a bridge is safe because it hasn’t fallen down yet. With theory, you could in principle make stronger statements — about what a model has learned, what its failure modes are, how its behavior will change as it scales.

“Unless you trust the AIs to police themselves you don’t want to totally hand over control, and having some kind of fundamental theory gives you a foot in the door.”

The honest bit at the end

Both authors are clear-eyed about how early all this is. The paper isn’t a textbook for a finished science — it’s a syllabus for a course nobody has taught yet. Most of the answers don’t exist. The ten “open directions” they list are genuinely open. They invite young researchers to pick a pocket, find a problem, and start laying bricks.

Their image is patient: science is a building constructed brick by brick. No single paper will solve it. Each contribution should be a small, solid, load-bearing thing that supports the next brick on top.

Key Takeaways

Learning mechanics is the proposed name for a “physics of deep learning” — a quantitative, mathematical science of how neural networks learn, distinct from mechanistic interpretability (the “biology” that names circuits and assigns meaning).
The fundamental puzzle: in deep learning, engineers design the training process but the model designs itself. Unlike civil engineering, design choices act on the final product only indirectly.
MUP (maximal update parameterization) lets you tune hyperparameters on a small model and transfer them to a large one — by preserving certain non-dimensional quantities under scaling. This is the field’s first major theory-to-practice win.
Discretization hypothesis: practical deep learning is a discretization of some ideal continuous system. Scaling up = finer mesh, not qualitative change. Explains why “just add more parameters” keeps working.
Edge of stability: during training, the loss surface’s sharpness grows until it hits exactly 2/learning rate — the value where classical gradient descent becomes unstable. Networks ride right at the edge.
Neural scaling laws are the deep learning equivalent of the gas laws — empirical regularities currently driving every frontier training run, still awaiting first-principles explanation.
Two infinite-width limits exist: the neural tangent kernel limit (clean math, no feature learning) and the MUP/feature learning limit (harder math, more realistic). The field has converged on the second as the right object of study.
Simplicity bias: even stripped-down models (linear networks, kernel methods) reveal that learning naturally picks up the strongest signals in the data first. This bias is probably why deep learning generalizes at all.
Platonic representation hypothesis: different models trained on different data may converge to the same underlying world model. Two diffusion models trained separately can produce identical images from the same noise as they scale up.
The deep insight: complexity lives in the data, not the model. The model’s job is to be a sensitive instrument that takes imprints from data — most of the action is in what data you show it.
The field’s central unsolved question is comparing representations across models — how do you tell if two networks are “thinking the same way”? In high dimensions, dissimilar things can look similar.
Five lines of evidence that a theory is emerging: analytically solvable settings, insightful limits, simple macroscopic equations, disentangled hyperparameters, and universal phenomena across architectures.

Claude’s Take

Score: 8/10. This is a good conversation about an important paper, slightly hampered by the format. Two theoretical physicists explaining their research to a sympathetic CEO is going to drift technical, and it does. But the central thesis is genuinely interesting and the framing — learning mechanics as the physics counterpart to mechanistic interpretability’s biology — is a clean, useful distinction.

The strongest part is the discretization hypothesis. It’s the kind of reframing that, if it holds up, changes how you think about everything. The idea that scaling up isn’t producing qualitatively new behavior, just a higher-resolution simulation of some fixed underlying object, is both unsettling and beautiful. It would mean the AI scaling story is less “we keep stumbling into emergent magic” and more “we’re rendering the same scene at higher fidelity.” Whether that’s true is exactly the kind of question the program proposes to answer, which is the right kind of recursion.

The weakest part is the unavoidable one: this is a perspective paper about a field that doesn’t exist yet, by people who have a vested interest in the field existing. Some of the optimism reads as motivated. The historical analogy to chemistry is doing a lot of work — chemistry got PV=nRT and then kinetic theory, and that took about a century with a lot of dead ends. Deep learning theorists keep promising that the field is on the cusp; this paper is the most credible version of that promise, but it’s still a promise.

What I’d note is that the practical payoff so far is small. MUP is real and useful, and the edge-of-stability work is intellectually beautiful, but neither has shifted the trajectory of frontier AI development the way the authors clearly hope theory eventually will. The bridge from “we understand the dynamics of deep linear networks” to “we can make grounded safety claims about Claude” is very long. It might still get built, brick by brick, as Simon insists. But the speech about safety at the end felt like the obligatory third reason rather than the central motivation.

Worth reading the actual paper if you’re an ML person. Worth this 90-minute conversation if you’re curious about what serious theorists are actually working on, as opposed to the standard “AI is magic” or “AI is just statistics” framings that dominate everywhere else.