heading · body

Transcript

The Physics Secret Behind Neural Nets Weirdest Phenomenon

read summary →

Have you ever watched someone learn something in a way that feels almost backwards? Like a student who can ace the exact homework problems they’ve seen before, but completely freezes on a new one. Until one day it suddenly clicks, and now they can solve whole new problems they were never shown. Neural networks sometimes do an eerily similar thing. And when it happens, it looks so dramatic that it got its own name. Grocking. Interestingly, this same phenomenon also happens in physics. What physicists refer to as a phase transition. Grocking is not about a model becoming conscious or mystical. It’s about a very particular training story. First, the model gets great at the training set by effectively memorizing it while still doing badly on new examples. Then, much later, sometimes after training for a thousand times longer than it took to memorize, it abruptly starts doing well on new examples, too. It’s as if the model stopped relying on a pile of flash cards and finally learned the underlying rule. The surprising part is the delay. If a simple rule exists, why doesn’t the network find it right away? Why does it so often settle into a good enough memorizing strategy first and only later switch to something more like understanding? To make this concrete, imagine a toy task that’s famous in grocking research, specifically modular addition. On a 12-hour clock, 9 + 5 doesn’t give 14, it wraps around and gives two. Now replace 12 with some other number, say 17, and ask the model to learn that for any two numbers from 0 to 16, it should output their sum with wrap around. There’s a clean rule that works for every possible pair. Here’s the twist. We don’t show the model all possible pairs. For modulus 17, there are 17 * 17, which means 289 input pairs. We might show it only, say, 40% of them for training and hold the rest back as the test set. If the model merely memorizes the training pairs, it will look brilliant on the training set and mediocre on the test set. But if it learns the wrap-around rule, it will suddenly become good on almost every unseen pair, too, because the rule applies everywhere. And that’s exactly what grocking looks like in the plots. Hey, if you are looking for a data science job in the industry but feel unsure about what the steps to take next, I invite you to join our boot camp. In this program, I offer weekly guidance on what to do next. My focus will be on learning by doing, building three to five portfolio projects based on your chosen domain and KPIs, developing your online presence on GitHub and LinkedIn, growing your data science network and connecting with the right people, and preparing you for interviews. And the best part is that you will either land a data science job in the industry or get a full refund. The link to apply is in the description below. Early in training, the training performance shoots up quickly. The model seems to get it, at least on the examples it’s seen. But the test performance stays stubbornly low, often flat for a long time. Then, after a long plateau, the test performance rises rapidly. The gap between train and test collapses, and the model finally behaves like it learned the rule rather than the examples. If you’ve trained machine learning models before, this is a little unsettling. We’re used to the idea that if a model is going to generalize, it typically starts showing that fairly early. Grocking says, “Sometimes the model spends a long time in what looks like a dead-end strategy and only later escapes it.” Now, one important misconception to clear up. That jump can look more dramatic than it really is. If you only check test accuracy once per epoch, and accuracy itself changes in steps because it’s just right or wrong, a fast improvement can look like a cliff. If you also plot test loss, which is a smoother measure of how confident the model is, you often see a steep ramp rather than an instantaneous jump. Still, even when you measure carefully, something real is happening. The model’s internal strategy is changing. So, what’s the strategy shift? A helpful way to think about it is that the training data usually doesn’t uniquely determine a single solution. There are many different functions a neural network could implement that all fit the training examples perfectly. Some of those functions are basically memorization, essentially a complicated lookup table that happens to give the right outputs on the training points, but behaves unpredictably everywhere else. Others correspond to a genuine rule that extends correctly to new inputs. Both fit the training set. Only one fits the world beyond it. Training then is not just learning facts. It’s more like a search process that selects one solution out of a vast menu of solutions that all score well on the training set. Grocking happens when the search process first lands on a memorizing solution and later migrates to a rule solution. At this point, the physics secret starts to become useful as a set of metaphors that help us reason about why this migration can be slow and why it can suddenly speed up. Physicists often imagine systems as moving around in an energy landscape. Picture a mountain range of valleys and basins. A ball rolling downhill will quickly fall into some valley. But which valley it ends up in depends on the shape of the terrain, how much friction there is, and whether there’s noise shaking it around. In machine learning, the height of the landscape is basically the training objective, meaning the loss, or how wrong the model is, plus whatever penalties we add. Now imagine there are two broad regions in this landscape. One corresponds to memorization strategies, including many parameter settings that fit the training examples. The other corresponds to rule strategies with fewer parameter settings that implement the clean algorithm. The crucial idea is that these regions can be very different in width. In physics terms, width is related to entropy, which basically means how many different microscopic configurations correspond to the same macroscopic behavior. In plain language, how many different ways are there for the model to be a memorizer versus how many ways are there for it to be a rule follower? Memorization often has a huge volume of solutions. If the model has lots of adjustable knobs or weights, there are many ways to tweak them so that the model matches the training examples. In contrast, implementing the true rule might require a more coordinated internal structure with weights aligning just so, representations organizing in a particular way, so the set of rule-like solutions can be narrower, at least early in training. Here comes the first big concept. Even if a rule is conceptually simple for a human, it may not be the easiest thing for gradient descent, which is the training algorithm, to stumble into. Gradient descent is not a scientist looking for elegance. It’s a local downhill process that follows whichever direction reduces training error fastest from where it currently is. And in many tasks, the quickest downhill path is a shortcut. Memorization can reduce training loss quickly because it’s flexible. The model can patch mistakes one by one. Rule learning can require building the right internal machinery before it pays off. So, training often finds the patchwork solution first. But then why does it ever leave? This is where another physics-flavored ingredient matters, regularization, and especially weight decay. Weight decay is a simple training rule that gently punishes large weights. You can think of it as a constant pressure towards simpler, less extreme settings. It just says, “Don’t use huge, contorted parameter values if you can avoid it.” And here’s the subtlety. Even after the model has essentially fit the training set, weight decay keeps acting. The model can sit at near-perfect training accuracy, yet the weights are still being nudged, step after step, towards smaller norms. Meanwhile, the randomness from mini-batches, because we train on small random subsets of data each step, adds a kind of jitter. In the physics analogy, that jitter is like temperature. It shakes the system enough to explore nearby configurations. So, during the long grocking plateau, the model isn’t doing nothing. It’s drifting. Training error is already low, so the strong downhill force is gone. What remains are weaker forces, the steady pull of regularization, and the gentle shaking of optimization noise. Over time, that drift can gradually change which kind of solution is easiest to maintain. Now we get to the second big concept. Memorization and rule-following can trade places in terms of which is more stable under these pressures. A memorizing solution might fit the training data, but it can be fragile under weight decay. If the memorizer relies on a lot of specific sharp adjustments to nail individual cases, shrinking the weights can slowly erode those special case hacks. A rule solution, by contrast, might be more economical. It can fit many cases with a smaller, more structured set of weights. So, as weight decay continues, the rule-based strategy can become the better deal. In physics language, you can frame this as a kind of competition between energy and entropy, often combined into something called free energy. You don’t need the equation to get the idea. One term rewards fitting the data, meaning lower loss, another term penalizes complexity, such as large weights, and the entropy term captures how many solutions exist in a region. Early on, the model is strongly driven to reduce training loss, and it gets paid quickly by going into the big, roomy memorization region. Later, once loss is already low, the simplicity pressure starts to matter more. The balance shifts. Eventually, the rule region becomes the cheaper place to live, and when that balance crosses a tipping point, the transition can be fast. That’s the essence of the phase transition analogy. In a phase transition, you change a control knob smoothly, things like temperature, pressure, a magnetic field, and the system’s behavior changes abruptly. Water suddenly freezes, a magnet suddenly aligns, a material suddenly becomes superconducting. In grokking, the control knob can be training time, or weight decay strength, or learning rate schedule, or dataset size. The order parameter, meaning the thing that shows the phase, can be test accuracy, or the generalization gap. You don’t need to believe neural nets are literally water molecules to find this useful. The phase transition lens is simply saying the training process can hover near one regime for a long time, and then a small additional change pushes it into another regime quickly. This also explains why grokking is so sensitive to training choices. If you stop early, you never see it. If weight decay is too weak, memorization can remain comfortable forever. If weight decay is too strong, the model can’t fit even the training data, so it never reaches the plateau where the slow drift can do its work. There’s often a Goldilocks zone where memorization happens first, but is eventually destabilized. It also explains why different optimizers and batch sizes change the story. Small batch sizes introduce more noise, essentially more temperature. That noise can help the model explore and escape narrow grooves, but too much can keep it bouncing around without settling. Learning rate schedules can act like cooling. If you lower the learning rate, the system explores less and freezes into whatever basin it’s in. If you cool too early, you might lock in memorization. If you cool at the right time, you might help the model settle into the rule basin once it becomes favorable. Even model size can pull in opposite directions. Bigger models can memorize more easily, which might extend the plateau, but they also have more capacity to represent the rule. So, under the right pressures, they might grok more cleanly. That’s why grokking isn’t a single-knob phenomenon. It’s an interaction among capacity, data, regularization, and training dynamics. Now, a fair question, does this only happen in toy problems like modular arithmetic? It’s easiest to see in clean rule-based tasks because you can design tests that force true generalization, specifically new combinations that can’t be solved by surface similarity. In real-world images and language, the data is messy and full of overlapping cues. Improvements tend to look smoother, and the rule might not be a single crisp thing. But grokking-like behavior can still appear as islands. A model might suddenly get much better at a structured subskill like arithmetic in text, syntax-like patterns, or algorithmic transformations, even if overall performance improves gradually. And here is another question. How would you tell the difference between memorization and genuine rule learning? The best test is does it work on variations that break superficial cues? If you rename symbols, permute labels, increase the size of numbers, extend sequence lengths, or change irrelevant details, does the model keep working? Rule learners tend to be stable under these changes. Memorizers tend to collapse. This is why grokking matters beyond being a quirky plot. It’s a warning and a promise at the same time. The warning is that a model can look perfect on training data. It can look stable for a long time while still relying on a shallow strategy that won’t transfer. If you stop training when progress seems done, you might freeze in the memorization phase. Plus, if you only evaluate in a way that doesn’t expose shortcuts, you can fool yourself into thinking the model understood something it didn’t. The promise is that training longer under the right kind of simplicity pressure can sometimes transform the behavior from brittle to robust without changing the architecture or feeding it new data. The network can, in a sense, discover the underlying algorithm it was capable of all along. So, what’s the satisfying conceptual takeaway? The one that makes grokking feel less like a ghost in the machine. Grokking is what you see when learning is not just about fitting examples, but about choosing among many ways to fit them. Early in training, the easiest way down the hill is often a patchwork. Memorize, exploit shortcuts, grab quick wins. Later, once the training set is already conquered, the subtle forces can slowly reshape the balance of power. Eventually, a rule-based strategy becomes the stable, economical way to keep the loss low, and the model transitions into it. The jump looks sudden because you’re watching a tipping point, the way a physical system snaps into a new phase. In other words, the physics secret isn’t that neural networks obey mystical laws. It’s that the same intuitions physicists use to understand why systems get stuck, drift, and suddenly switch regimes also help us understand why a neural network might spend ages looking like it’s just memorizing right before it finally unmistakably starts to generalize. Thanks for watching. Until the next video, take good care of yourself.