There Will Be A Scientific Theory Of Deep Learning
read summary →At the heart of trying to understand artificial intelligence or how deep learning works is about understanding learning and learning is about movement. Learning is changing parameters. And so it’s the model moving through some parameter space. [music] And physics has spent centuries building up tools and ideas and thought processes on how to think about movement. [music]
Hey, welcome to Generally Intelligent. I’m your host Kan Jun and I’m the CEO of Imbu. We’re an AI company whose mission is to make [music] tech serve humans. And we do that by building open agent infrastructure tools, agents that [music] help humans maintain control as AI capabilities grow. And today we are talking about a new perspective paper from Jamie Simon and Daniel Kunan [music] and their 12 other co-authors. And this paper argues that there will be a scientific [music] theory of deep learning. And it gives a vision of how this will look. So this is a little bit controversial actually in the field and we’ll get into that. Jamie [music] Simon is a deep learning theorist. He did his PhD in physics at Berkeley advised by Mike Dwise. [music] Uh he won the department’s best thesis award and his research aims to build first principles understanding of [music] the learning behaviors of neuronet networks often taking inspiration from ideas in physics. Jaime [music] is also a research fellow at Ambu. Uh we fund this work because we believe a scientific theory of deep [music] learning is a great tool for democratization of power. Daniel Kunan is a posttock [music] at Berkeley also hosted by Mike Dwise and Peter Bartlett. Daniel completed his PhD at Stanford advised by [music] Surah Ganguli. Uh his research integrates insights from statistics, physics and neuroscience to study the mathematical principles of artificial and natural intelligence. We also have here Josh Alrech Imbuse’s co-founder and CTO. Josh is a former machine learning researcher who recently has been focused on building open source tools for agents. Josh runs about a 100 agents in parallel right now and has been shipping like 50,000 [music] lines of code every day. Um, sorry, a week, [laughter] not day. We’re not there yet. Maybe in a year or two. Uh, which is kind of crazy, but really interesting. Um, [music] kind of bringing the applied lens to deep learning. [music] Let’s get started. [music] You guys are publishing a paper and uh you make the case in this paper that there will [music] in fact be a theory of deep learning and you talk about the emerging evidence for such a field and you’re calling the field learning mechanics. Can you tell me what is learning mechanics? How would you describe it? Uh so I would say um [clears throat] learning mechanics is this this term that we’re proposing uh for essentially the the like a a a fundamental mechanistic uh mathematical science of the learning and other behaviors of neural networks. So um uh if you’ve heard of mechanistic interpretability, uh it often frames itself as the biology of deep learning, right? Uh it approaches things with a with a sort of a biologists or a system neuroscientists lens trying to pick apart anatomy, identify circuits, and uh connect these with like a semantic interpretation of what these models are doing and how they’re thinking. Um it’s mostly qualitative science. Uh, learning mechanics hopes to uh uh be essentially the the the physics of deep learning like a first principles highly mathematically grounded counterpart more distant from the semantics uh and closer to things like the training process and the selection of hyperparameters and the dynamics of how all this learning happens. Um and and a theory like this uh gives a solid foundation to ask lots of other questions in um in uh in in in the world of deep learning that you might want to know the answer to. Something I’m curious about is why does deep learning need such a theoretical foundation? Like what does the physics konico of deep learning give us that the biology doesn’t? Well, having a theory of deep learning would have many both practical, scientific and and safety uh implications. So, from the practical point of view, deep learning has historically been driven mostly by trial and error and uh seeing what works and what doesn’t work and kind of going pretty much gradient descent in terms of practice. Um, and having an actual theory for the kind of foundations of what’s driving success in these tools would allow us to be more theorydriven, more efficient, and uh potentially more uh in that line also more more safe. We could uh think about the risks that come with these technologies and design ways to mitigate those risks or um whatnot. uh in terms of the scientific reasons uh I think one it’s just a fascinating idea to try to build an understanding of these very complex machines that are generating text and images and actually try to understand what’s really driving that success and then there’s the kind of more neuroscience side of things which is if we really understand how artificial intelligence is working that might also provide a lens in understanding natural intelligence so building kind of a scientific theory of deep learning um would all these different practical uh safety and and scientific implications. So one thought is like okay if we have a physics of deep learning instead of just mechanistic interpretability we have less grad student descent and more like engineering for deep learning systems. Um so less guess and check and more kind of like causal predictive uh models for how we should expect uh training run to pan out based on different dynamics. Yeah, deep learning is a really I mean machine learning in general is a really interesting technology because our choices don’t directly go into the final product, right? When we design these systems, we’re designing kind of a playground that iterates on itself and the result is the is a model. So we’re actually our our design choices as engineers are imp are implicitly affecting the final product making it very difficult to always you know know when that product might you know like when we design a bridge we’re designing the bridge itself and so we can uh kind of understand how our design choices affected if the bridge is going to fall or not. Um and of course having theory of civil engineering like gives us an ability to to to have understand the conditions under which bridges might fall or not. Uh that’s very difficult when our design choices are not going into the final product. So this idea of building a learning mechanics is in some sense trying to understand how those design choices in the beginning into the setup whether the data the architecture the learning the hyperparameters of that of that whole process how those affect that final product. It’s kind of like how do the initial conditions pan out? Yeah. Yeah. And and to the question of why do we want a sort of fundamental physics in addition to uh a kind of biology or systems neurospersive uh there just there are things that you can do with a quantitative science uh that are much harder to do with a qualitative science. Um and then I mean the opposite is also true. There are things you can do with a semantically aware science that you cannot do with a a sort of dumb science like explaining cognition in the brain from quantum physics alone, you know, is very difficult. But also explaining explaining why neurons fire without an understanding of uh you know atomic physics uh is impossible. So um I [clears throat] mean if we really want to understand this, if we’re serious about the challenge of understanding deep learning and putting together some kind of publicly available theory about this, then we do want to be studying it at all levels of abstraction. Uh and you know and this is this level uh happens to resemble physics. you know there’s no like dogmatic reason why there should be a physics you know any any more than like oh you know there should be a pharmarmacology or or whatever of deep learning uh it’s just like you know this is dec it’s descriptive rather than prescriptive you know uh it’s just like in trying to understand these things uh asking natural questions we were finding just uh and people in our field have found that asking certain types of natural questions just leads you inexurably towards this sort of like first principles quantitative problem that’s at the root of everything. Uh and you know we think that problem has an answer that’s going to look uh kind you know kind of like a physics in some ways and not in others. Um and that’s what we’re trying to articulate here. Yeah. To me, uh, just kind of piggybacking on this, uh, learning is really a a process of movement, right? Learning is changing parameters. And so, it’s the model moving through some parameter space. And physics has spent centuries building up tools and ideas and thought processes on how to think about movement. That’s in the physical space. So, at the heart of trying to understand artificial intelligence or how deep learning works is about understanding learning and learning is about movement. That’s super interesting. I love that way of thinking about it. And also, one thing that’s nice about deep learning systems is that they’re completely measurable. Um, unlike biological systems, oh my gosh, [laughter] you can learn so much. So, so both of us um uh you know, I I did my PhD in the Redwood Center in Berkeley. Dan is a posttock there and I’m a visiting scholar there. And uh we hang out with neuroscientists all the time uh and and get to see firsthand how difficult the task is of doing theoretical neuroscience be because you can write down all the mathematical models you want but you’re so limited in what you can measure about the brain. I’m curious like why now? What were you noticing that led you the 14 of you co-authors to get together and write this paper now? Not 5 years ago, not 10 years from now. like you know what are you observing? Well, there’s a simple answer to that which is most of us just graduated from our PhD. Okay. Um but uh you know I don’t think that we actually got together with the purposes of writing this paper. Um we got together to share ideas and to talk about research. All of us think deeply about trying to understand deep learning. Um and we have we take different approaches. many of us have kind of either a physics background or a physics perspective. Um but that’s not true for everyone in the list. Um and so really Jamie uh kind of pulled us together as a group. Um he we we had all known each other either we were in the same institutions or neighboring institutions. Jamie’s at Berkeley, I was at Stanford or we were going to the same conferences. And at some point Jamie approached me and and he was like why don’t we you know create a community among these grad students that are studying the same ideas with similar perspectives but in different institutions and in different labs and try to just come together to think more deeply about what we’re doing and to share our ideas not just through archive papers but you know through conversation and dialogue. And so this paper really actually came out of uh kind of a retreat that Jamie organized um where we all came together in the woods in Birkshire and we’re cooking food for a week and sharing our research ideas and at some point we realized we all had quite different perspectives on how we do research. So we might not be coming together to you know come up with a new technical contribution to the field but we realized that as a field we really needed to kind of summarize the results that and progress that we’ve made and and kind of our internal intuitions about open directions and next steps and so this paper kind of came out of that. Why now more broadly in the field? I’ll let Jamie answer that. Yeah. So there are a few reasons why the why the present moment is particularly promising uh for the development of a scientific theory of deep learning. One of the reasons is just like deep learning has never been more you know accessible more commoditized uh more highly studied. We now like have a pretty good convergence on methods for you know training large scale systems that are you know like fairly reproducible work fairly well. it was harder to do science of large models in the time when things were still being hashed out, but but now, you know, that landscape is kind of solidified. Most of us are are sort of in in this like very fortunate uh uh spot in our careers that offer us time to do it. Um that’s very exciting. You’re free. Yeah. To [laughter] work on the important problems. Yeah. Yeah. And I think that’s a key word, important. Right. there. We’re in a pretty unique time right now where the practice of these technologies is skyrocketing and we can see all the effects that they’re having in our world uh both here in San Francisco and the Bay and more broadly and so I think the importance of really trying to understand these things is uh is becoming more important. Yeah. Totally. uh you know and also in in our field I mean we we’ve been watching the field of deep learning theory and the academic efforts to understand these systems uh you know like push forward and grow and change and hit walls for you know five six seven years uh we’ve seen a lot of things that have worked um and uh you know and and we’ve like our our our assessment we found getting together was oh yeah, we think that there are serious ways in which this effort that has for a long time been embattled for for a long time has been really difficult, you know, to get a to get like purchase on the problem is actually starting to work. Things are starting to come together in different ways. So this paper is like sort of we’re trying to convey this tone of optimism uh as we organize uh you know different lines of evidence that things are actually coming together. In everything we do in this paper, we’re trying to be descriptive, not prescriptive. Uh like we didn’t set out to write this optimistic paper. We we talked about what we thought was happening and and what and where we should go next and found that the natural thing to do was to, you know, articulate these emerging ideas in one place and um to try to, you know, drum up momentum and excitement uh to for making them happen. Yeah. Maybe we can get into some of the details of like why it is that you both believe that now is it’s possible to make this type of a theory. For a while there has been a sort of reputation among practitioners in the past say 5 or 10 years of people doing deep learning that like oh theory is kind of useless or pointless right like our empirical results like really are a lot further ahead than our theoretical understanding of at least neural networks. And this was not always true in the past. Like we had a pretty good understanding of like oh you know how do like random trees or random random decision forests or how do like support vector machines how do some of these other systems work. We had at least a little bit better system or like understanding and characterization of them. Um but with deep learning there was like this kind of bigger and almost it felt like growing gap where like okay we’re sort of leaving theory behind like what is it that has changed in the past year or two? Like what kind of new tools are there? What kind of things are there that give you this optimism that like oh actually might be able to make really interesting progress on the theory of deep learning? Yeah, totally. Widen for a moment like the the idea of what theory is, you know? So inside these large labs, they’re like massive science of scaling teams, right? They’re they want to ask how does every hyperparameter scale with every other hyperparameter, you know? Uh getting these things right is really important to scaling up a large model. you know, the company that does this better could have the better model, you know, after the next training run. Um, and like the identifying these like uh empirical hyperparameter scaling relationships is sort of like a protein version of theory. Uh, you know, it’s it’s like, oh, okay, there’s an exponent on this log plot. It’s looks it appears to be two uh, you know, to within some error bars. And great, that’s useful, right? you can now like when when you get a clear signal like that, you can, you know, carve off the problem, toss it to your friendly neighborhood theorist, uh, and they can they can there’s often a nice reason why it’s two or whatever it is. uh you know and and like as the large scale training models become more and more systematized getting things right you know uh uh fairly early on you know instead of just trying every possible permutation uh of architectures and hyperparameters um has become more and more important. So, so it sort of pushed this like science of scaling perspective which has made it clear that oh there are principles here. Um uh there’s a there’s a famous example particularly celebrated in the theory community uh called MUP maximal update parameterization MUP um leads to a technique called uh MU transfer for taking certain hyperparameters like the learning rate for example uh of a smaller model and scaling it up to a larger model and this had breakthrough success was used you know a lot around the the like GPT4 GPT5 uh era you and and in different forms is is now kind of just baked into the way people think about scaling up models. Um that paper was actually uh uh uh the the it came out right when I joined Imbu and you guys had me um present it at a journal club. Mhm. I do remember that. Yeah. Greg’s paper. Yeah. Can you give a very brief description of that paper and like sort of the from a technical perspective? Yeah. Yeah. Absolutely. So here’s something that uh every practitioner training large models has had happen to them. Um uh you’re you’re you’re you’re going and training a large model, you know, but it’s really expensive. It takes hours or days or weeks to to like train the whole thing. Um but there’s some numbers you have to set. These numbers are called hyperparameters, and they’re like learning rate, the depth, the width, the number of heads in your transformer, all of these things, you know. Uh I could go on for an hour with all the hyperparameters. and you go and you train the large model and you find, oh, that did not work as well as I expected it would work and you think, oh, I got the hyperparameters wrong. So, so what do you do? It’s too expensive to just try different combinations on the big model. So, you try it on a small model and you find some good hyperparameters and you go to the big model and they still don’t work very well, right? Mute transfer is essentially uh a a a technique that like identifies certain um like non-dimensional quantities that it prescribes should like remain the same upon scaling up in particular the width of a model the hidden dimension um and by preserving these so you know it ends up looking like you scale your learning rate with a certain exponent relative to your width and did it with the initialization scale of each layer um uh you you preserve non-dimensional quantities related to training and feature learning and this like happens to uh to often you know preserve like good performance from smaller models to bigger models. So this uh can like really reduce your your overhead and hassle when it comes to doing the big model. Yeah. So practically it means that you can spend your time optimizing your hyperparameters on small models, find those optimal conditions and then transfer them to large models. An analogy that I think is particularly apt uh is um building a small model of a bridge that before building a big model, right? You know, when you go and build the Golden Gate Bridge, it’s more expensive to build the big model. Sure. Yeah. You want to iterate on the small one. I’m confident that for any major bridge construction project, there are many competing designs that must be chosen between, right? Uh and and when you know, but you can’t just go build the Golden Gate Bridge like 10 times and see which one stands up. Uh you know, which one bears the most weight, right? So, uh unless you raise enough money, unless you raise enough money. Um the uh so so you know, one way you could do this is you build a small model of the bridge and you see, okay, which of these works the best. Well, hang on a minute. It’s not that easy because how do you know that the small model is all informative? You know, as you scale something up, materials properties change. Just look at how, you know, ants can support so much more than their weight, but we can’t, right? Um, so like h how do you make a small model that’s informative for the big model? Well, you can do this if you understand some things about material science and like scaling relationships of of uh you know of different like stress and strain uh you know on fracture quantities as you make a model bigger right so like uh what what this mute transfer idea is essentially doing is identifying the the right like non-dimensionalized quantities to like you know closely look at in your small model and to to like be informative of your large model and these are kind of Like so this is mupmut transfer is a good example of how theory applies to practice and helps us train larger models more effectively and more cheaply um and like be able to run small scale experiments to do that. Yeah. Um I kind of want to dive into the paper itself um where and talk through some more examples like this. So you guys state five observations that serve as evidence that a theory is emerging. Um the dimensions are that one there are analytically solvable settings um that exist. Uh two that like insightful limits actually do reveal fundamental behavior. Three that uh simple equations can capture meaningful macroscopic statistics. Four that hyperparameters can be disentangled and actually understood so it’s not just all a big mess. Um and five that even across settings and tasks where the training setting is different you end up seeing universal phenomena. Mhm. Um and I kind of want to dive into each of these uh maybe one by one, but before we dive into them, one thing I’m curious about is like how did you get to these five lines of observations? And what is it about these five sets of observations or five lines, five groups of observations um that made you include them in this paper? Like were there other observations that you didn’t include? Are these five saying something in particular? Um are they comprehensive or or not? you know, I can’t completely remember how we settled on those five, but I do remember the ordering of those five had quite a bit of uh uh iterations. Um yeah, so in terms of uh are there other lines of evidence uh that there will be a theory or scientific theory of deep learning, I think yes, this is not a comprehensive review of all papers in deep learning theory uh and all different approaches. There’s researchers that take other approaches that we are not reviewing in this work. Um these are kind of the approaches that we focused on as being things we thought are most promising examples uh or evidence serving uh that there will be this theory. Um and they all do have this flavor to them that is somewhat uh physics in spirit or physics inspired right uh simple models macroscopic variables taking limits uh disentangling hyperparameters or uh and kind of a universality these are all they’re all in the spirit of mechanics and and so I think this is partly why we settled on these five um I don’t know if you remember yeah so I mean my memory of this is uh We were at this cabin in the Birkers. It was like day three and you know most of these 14 authors were there and we were trying to articulate what our shared vision was like kind of what was in common between our perceptions of what really mattered and where the field would and should go. Uh, and we just started filling up tons and tons of big whiteboards uh with with different like uh like different clusters of ideas, you know, and we really changed uh many times in how we thought about these. At first they were like guidelines for a young person getting started in the field to follow like oh hey you should look for toy models. Oh hey, you should take limits when you can, you know, oh hey, make sure you study your hyperparameters. Uh, and then we realized um after a few iterations that there is a stronger framing that hey, each of these isn’t just like a recommendation. It actually carries with it uh both some clear successes uh uh from the last, you know, 10 years in deep learning theory. um uh and is forward-looking in that [snorts] it’s like suggests important open directions and ways to think about research going forward. And they all together uh have this sort of mechanics flavor, you know, this this flavor of of like a classical or a stat statistical mechanics where it’s like learning is about movement and you study the the the interaction of these components and how they how they all lead to learning. You’re studying the mechanics of that movement. Yeah. Interesting. Go ahead. Well, at at the heart of like if we’re if we’re saying there will be a theory of deep learning, we might kind of taking that point like what is actually the barrier to a theory of deep learning? That maybe is actually a more appropriate question than just saying like we believe there won’t be one. It’s like well why why why won’t there be one? Um we have access to the learning rule. We have access to the data. We know exactly what the architecture is. We know the the task and And as we said, we can measure everything and anything. So it’s kind of actually surprising that we don’t have a theory. Yeah. The question is like why haven’t we figured it out already? Right. So what would you how would you answer that? It’s not the opacity of the of the problem. It’s the complexity. This it’s an extremely complex interacting nonlinear highdimensional system and we’re trying to understand what’s going on. So, one way to think about all of these reasons that we put forward as serving as evidence that there will be a theory is they’re all ways of handling that complexity in a way. Kind of taking that complexity and simplifying it. Um, and so yeah, these are I think that’s another way to kind of view these five uh different categories. They’re they’re success stories. They’re kind of clusters of research papers and research approaches. And they all at the end of the day take kind of a different way of handling that complexity which is the barrier to a scientific theory. That makes a lot of sense. Yeah. And and I think to this question of like why isn’t there a theory already? uh you know the the sort of the sort of questions that were asked in decades past uh like hadn’t yet realized that that that you know the training processes of deep learning this like highdimensional complex messy thing that depends on real world data statistics like that there’s no way around that and this really needs to be grappled with right so like the classic theory that people still think of is often this like you know statistical learning theory be uh this this idea is that oh a simple parsimonious model that has high regularization it doesn’t overfit will generalize well you know and uh this is a really beautiful like mathematically correct self-contained theory it does point to important open directions in the modern paradigm um but like just you know at the time when this was developed uh you know we didn’t have modern deep learning there’s there’s no way that the that that that like this uh this understanding that like this complexity can’t be reduced or there’s no easy way out. You know, there’s no way that could have been worked into the DNA of that of that whole way of thinking. So, you know, I I I I think that um like the the more modern flavor of deep learning theory is less like this sort of mathematics that tries to just work out everything and offer a you know a nice simple guarantee to practitioners, but is actually much more like this like diverse rich very scientific approach that just dives into the complexity and and uh tries to make tries to make sense of the pieces of complexity as opposed to trying to simplify it down. Yeah. Yeah. And a nice thing about that approach is that we don’t need one simple answer to be making progress, right? like the you know the these generalization bounds from times past um you know about that are essentially formalizing AAMS razor saying like a simple model with high regularization like won’t overfit uh that’s kind of an endto-end theory and I think we won’t have that on deep learning for a while but that’s fine because diving into this whole messy process and finding bits of it that we can organize and finding structure and regularity and like there’s so much wonderful stuff to do and doing even organizing pockets of it is useful for people who are working in those pockets, right? So these are five pockets of kind of organizations of complexity. Yes. And the hope uh and and the the hope and our hope and belief is that these pockets can be widened and eventually linked together into something resembling a like comprehensive theory and this will happen in the next decade or so. Cool. That’s super interesting. Let’s talk about some of the pockets. Yeah. Yeah. Should we start with the first one? Yeah. The first kind of like organized pocket is analytically solvable settings exist. Can you tell us a little bit? tell me as a theory lay person like what does that mean and what does it let me understand or organize? Sure. Yeah. So um in fact I think my research probably most overlaps with this pocket um and certainly my pedagogy does as well. So the idea here is that deep learning let’s talk about first what deep learning is. It’s an architecture a data set a task and a learning role. uh and the idea is that sometimes we can find simple versions of the we can choose you know a simple architecture maybe a simple learning rule a fixed data set and we can see how all those pieces interact and and in fact we can get analytically solvable settings. Um so in this in this kind of uh discussion here in in 2.1 of this paper in uh that section of the paper we we talk about two uh different ideas of finding these simple analytically solvable settings. They’re both at the end of the day thinking about linearizing the problem. Linearizing it in terms of the data or in terms of the parameters. So linearizing in terms of the data is saying hey deep learning is these uh linear followed by nonlinear transformations like a sequence of these. If we just got rid of those nonlinear transformations and just thought about training a model that’s just a sequence of linear transformations from input to output, what would that look like? What would the training of a linear network or a deep linear network look like? Would it be the same as training a shallow linear network? And in fact, this this setting has had uh immense progress in the last 10 15 years about showing that we can get very that first off that answer to that question that that they’re different that the deep linear network’s dynamics the solution that it learns is not the same as a shallow linear network. So the depth really does have an effect on the learning process and we can understand that effect uh directly uh and that that effect which is generally that there’s kind of a preference in terms of how that a deep linear network learns the task breaks the task down into let’s say the well I don’t know should I principal directions yeah singular vectors of of the task uh it has a preference in terms of its ordering how it learns those singular vectors uh And that idea of kind of a simplicity bias or learning some things before other things is a hallmark of more modern deep learning as well or more realistic deep learning. Um so this is one one example. I think that’s a really good example because when people normally think about oh well you simplified the problem and you simplified it too much like if you remove the nonlinearities like in theory a deep network a deep fully linear network and a shallow fully linear network are like mathematically kind of the same right there’s nothing actually interesting happening in that but the learning mechanics make them slightly different right that’s right and and so I think this is interesting because you’re you’re saying that like by splitting it up you can see actually where some of the parts of deep learning are coming from of this idea of like learning the simple things first and then learning other more complicated things afterwards. So even though you simplified it by like taking out the nonlinearities and like effectively making it almost like just trivial just learning this one matrix or something, right? Like you’re still actually getting to learn something interesting about learning and about the setup by simplifying it and that maybe helps us like build towards our other more full picture even if we’ve made kind of unrealistic assumptions here. Yeah. So just so I understand linearization means you remove the bias is just weights the nonlinear the nonlinearities. Yeah. So uh linearizing a network in terms of in term in in the data would be that we take the nonlinear network and we remove the nonlinearities at every layer. So for example, let’s say you use an architecture with ru. We’re just going to get rid of the ru and use linear nonlinearities. So activation comes in, activation goes out. No change to the activation or the pre-activation and the activation are the same. That makes sense. What do you learn from these linearized settings that are applicable to nonlinear settings? And what do you lose from the linearized setting that’s like no longer applicable to the nonlinear setting? So I mean practically just to be clear, we’re definitely not suggesting you train linear networks, right? So you lose [laughter] the fact that the network is no longer it’s the not useful. [laughter] That’s right. It’s only usable for studying the dynamics. Exactly. And studying kind of the proc the Yeah. uh this idea that how do the initial conditions affect the the way in which the model changes in function space. Um so what you gain is analytically analytic tractability. It becomes a much simpler problem to study and we can think about how those four ingredients all interact. Um and in particular I think it points to this idea that the learning process is inherently biased towards simplicity. Um and that is something that I I believe is going to be a critical part. In fact I think that that idea comes up in other in our other parts of our paper in the other sections. Yeah. In in almost every section. Yeah. that and and and and going to be an essential idea behind why deep learning works so well is that it is kind of biased towards finding really meaningful aspects of the data, meaningful signals in the data before less meaningful ones. And by doing this, it’s kind of biased towards a simple or parsimonious solution. Uh even though it might look more complex, it’s actually biased towards simplicity which helps explain some of the generalization etc. Right? like because you’re starting with simpler things and so then maybe it can generalize better to some of the other earlier results etc. One of the things just a small tangent, one of the things that I’m not sure if it quite comes through in the paper is this idea of you know they learn simpler things first and like I I think that both of you have ideas about like you know what is what shape could this sort of grand theory look like? Uh, and this, at least to me, this paper isn’t necessarily just saying like, oh, like we sort of hope there will be a thing, but rather it seems like you each have ideas about like concretely what will this actual grand theory look like and intuitions about how these things will link up. So, it’s less and like they come from these types of things, right? Like, oh, like analytically we can see this and therefore we know like, oh, it’s learning simpler things first. There are a bunch of other kind of lessons like that that are if you read through the paper, you can kind of piece them together a little bit yourself. But I think maybe one thing that I just want to highlight for people listening is like you should really skip to the end also and [laughter] uh check out the future directions and you know reach out as well if you’re interested in that there’s like a lot going on here. There’s a lot of different there’s a lot of different evidence and like how it all fits together is is kind of interesting. It’s still emerging still early but uh yeah I’m also really curious about the like simple equations capturing macros uh meaningful macroscopic statistics section. Can one of you explain kind of like you know you talk about neuroscaling laws etc like what um what is the shape of this bubble? Yeah. Yeah. Great. Uh so so this so this is section 2.3 in our paper. It’s about uh simple macroscopic laws. So if you look back at how uh other sciences developed in kind of paradigm setting eras uh often there were a whole bunch of like disconnected empirical observations that people found that took the form of nice mathematical relationships or of called laws. um uh that that you know we’re then we’re only later explained or linked together or some you know all this stuff right so uh it’s actually not like okay the there are a number of examples like this like um including neural scaling laws uh which are I mean that’s an empirical law uh that is currently driving like everything that’s happening in Silicon Valley uh there um there’s something called the edge of stability effect uh that I think is a really beautiful thing um you know first found as an empirical law where basically if you look at you know as as you as you train a neural network a large neural network and you move over the loss surface right sometimes you’re in a spot where the loss surface is very smooth and sometimes you’re in a spot where the lost surface is very steep you’re in a narrow valley and this you can measure a quantity called the sharpness technically this is the maximum E value of the second derivative matrix of the loss surface you know uh it’s just this multivari calculus quantity and uh if you look at how this evolves over time you find it you know quite reproducibly on large models as you train take steps it grows grows grows grows grows grows but then it levels off uh at a particular value that’s given by it it seems empirically to be two over the learning rate. So this is this is uh this comes and and mind you this only shows up like so nicely when you’re doing gradient descent full batch so not stochcastic. Uh but still that’s probably not a coincidence. Yeah. What’s going on? [laughter] Yeah. So it’s not a coinc. So um the scientist uh like probably most responsible most associated with this sort of like progressive sharpening edge of stability uh idea is named Jeremy Cohen. He’s uh one of the authors on this paper. Um and uh uh you know the the the the observation that he makes um in the first paper about progressive sharpening is uh well hang on two over the learning rate is actually exactly the value for the sharpness the the hashing curvature that classical learning theory you know the traditional branch of uh like optimization work that you know it sort of took more place in the last uh century um predicts the in like an instability to start. So, so like if if you have a really simple function say, okay, let’s say if you have a value that’s too steep for your learning rate, then you’ll you’ll bounce out of it, right? Well, what’s that critical steepness? It’s two over the learning rate. Interesting. So, you know, but of course that assumes like you all the nice closed form analysis you could do with this assume some nice structure on the on the loss surface, right? Where you could you actually solve the dynamics. Indeed, this like is extremely reproducible in models large and small. Uh and also like has big implications for for your choice of learning rate and the effect that that choice has on uh you know your training run and your the later generalization of your model and all of this stuff. As a practitioner, how would I use this to choose learning rate? Uh so it’s a difficult question to answer because so many other things go into the choice of learning rate, right? I still think that that I mean I I think probably every Let me [clears throat] ask a different question actually. As a practitioner, how should I intuitively think about progressive sharpening or like sharpening and also why does it increase intuitively? Yeah. So so uh so this is actually something that that my collaborators and I are are trying to build a good theory for right now. like our our intuition is um kind of twofold that uh one in like when when the weights like the weight matrices of a neural network tend to like align with each other in a linear algebraic way and also grow in norm over the course of training and both of these things tend to lead to higher sharpness. Um so so that’s a qualitative explanation yet to be built into a quantitative theory. Um and then another reason is I I think there’s actually just uh like a I think there’s a sort of unifying geometric reason for this too. Like gradient descent um you know moves in the direction of steepest descent by definition. Uh, and you know, and there’s a reason why like if there’s a really if there’s a direction that it can really like fall off and speed up really really fast, like it sort of gets sucked towards those trying to find the steepest. Yeah. Yeah. So, so like uh uh there’s a sense in which I think neural networks kind of learn to learn faster. [snorts] Um, and that probably is explaining progressive sharpening. But I I should add a disclaimer that these are my personal hunches. Gotcha. Interesting. Yeah. And one of the open directions we talk about in the paper is actually really trying to understand the connection between sharpness and the edges stability and generalization. So we have I mean this is a a empirically observed and repeatable uh experiment and it’s very interesting. We have a theory for why the gradient descent dynamics can be stably at this edge of stability. But actually kind of connecting that directly to performance, generalization, feature learning is an open direction and a really interesting direction to pursue. Interesting. This bucket of things, simple equations capturing these like macroscopic behaviors. They’re often these like empirically observed laws and now this is like a place where a bucket of things where you can start to develop theory around them because they’re empirically observed and so consistent. Yeah. Wonderfully. Yeah. So, so if you look at the the history of the development of like chemistry or something, you have all of these like gas laws, you know, like, oh, pressure and temperature appear to be uh uh proportionally related, holding volume a constant. Oh, well, pressure and volume are inversely correlated uh uh holding um holding temperature constant. uh well why you know and there you know and you can combine these into PV equals NRT and like thinking about these kinds of things can let you sort of guess uh oh you know gases are made of discrete molecules that bounce around in a kinetic fashion uh that you know leads to a sort of like statistical mechanics view of the system you can imagine It would have been much harder and it involved making many more correct guesses to try to go the other way. You know, rather than top down empirical laws, oh let me explain. Okay, what’s a new empirical law and let me explain it to start instead by saying okay let us first write down a fundamental theory of gases at the microscopic level and let us make predictions at the macro level you know ah I I I posit that there should be this quantity called pressure that you know [laughter] scales like intrinsically in this certain way and you know and therefore right I mean this is what people are trying to do with string theory right now and and it’s like it’s really hard right you know [laughter] it’s hard it’s hard to just guess the answer and make predictions uh without some kind of like, you know, linking it up. Yeah. Yeah. Behavior. Yeah. It’s it’s interesting that we talked about the first bucket and the third bucket because they’re really the opposite of the inverse. Yeah. The first bucket is all bottom up. Uh building up from the foundational principles and and simple settings to try to understand phenomena and the other one is empirically top down. Um and un deep understanding of these things uh from the top down approach would have huge practical implications. Think about neural scaling laws and being able to a priori understand the exponent. What about the data optimizer architecture etc leads to that exponent in the scaling law. Uh being able to predict those scaling those exponents before you actually find them would be a a huge win for developing more powerful systems. Yeah, maybe we can dig into the second one. The insightful limits reveal fundamental behavior. Uh this seems like maybe uh yeah I guess I’ll I’ll let you kind of you know describe it’s not the top down or the bottoms up version but you how exactly do you think about using limits to understand uh what’s going on? Um I’m going to let Jamie talk about this but I will put first that I think actually maybe this is the most important of of all of the directions. That’s why I wanted to dig into this one as opposed to skipping over it. Yeah. And maybe the most uh physics and spirit. Um, why is it the most important? I think it’s where we’ve made the most precise statements about realistic systems is in these uh in when taking limits and thinking about these objects in their in the most high dimensional sense. Um, and so I think in that sense it’s the most important because it’s really has flavors of of a lot of the other pieces of for example 2.1 the analytically solvable settings. It’s like by taking limits we end up actually getting analyticical tractability. Um, uh, but it’s also related to section three in the sense that we’re actually talking about realistic systems. We’re not no longer talking about toy systems. Um so this kind of intersection of realistic and tractability by taking limits is uh has been a major success in the last decade in deep learning theory or less than decade honestly uh and this really probably starts I would say with Arthur Jacote one of the authors on this paper’s results on neural tangent kernels actually maybe it doesn’t I say starts a little a little before this but yeah tell us what taking limits on a system means we’ll get into and then we’ll get into the details Yeah, great. So, uh uh so okay, so so this is first to start off I mean limits truly in the ordinary calculus sense of the term um where you have some number describe your system. It could be a size parameter usually is or you know a learning rate or something some some parameter and you take it to either infinity or to zero. Um uh you know and and and uh taking a limit of a variable removes it from the expression it’s in. So in calculus if if I say like consider the function 5 plus 1 /x as x goes to infinity. Well, clearly the one overx vanishes at a rate of 1 /x, you know. Uh so so you know, so we could talk about the asmtoics and all this stuff, but like taking the limit gives you the course picture, which is if x is big enough, you’re left with the five. Um this is one of the foundational tools. It’s it’s maybe the foundational tool of statistical physics. Uh so so going back to the gas analogy, right? [snorts] Imagine uh imagine I hand you a a you know a little little box and it’s got in it 100 particles of gas and they’re bouncing around like dink dink dink dink dink and I’m like all right can June give me a theory of this right uh and you’re like how and I’m like well here I’ll I’ll tell you about all you can ask me any questions you want about the quantum mechanics of these particles about their interactions about you know like whatever you want to know about these molecules um and you’d say Okay. And then you’d go and sort of like go to the whiteboard and start thinking about how [snorts and clears throat] like you know 100 position and velocity variables kind of interact with each other, right? And it’s like you you can see it’s complicated. You’re tracking a lot of stuff, right? Um but then there’s this thing that’s paradoxical and amazing which is that well actually as I add more and more particles in some sense the problem becomes easier, right? Uh so you treat it like as one body instead of as 100 particles. Uh yeah. So so like you know if you have a like in in this glass of water or in you know in in my lungs full of air there’s you know what more than 10 the 20 particles uh and that’s basically infinity. Yeah. Yeah. So so like you can imagine you know that correction like 5 plus 1/x well if x is 10 20 then the five is going to be a pretty good approximation right. So um so like it’s actually really easy to derive the ideal gas law PV equals NRT once you realize like oh these are these are interesting quantities when I have lots of particles um like pressure volume and temperature uh and oh there’s a sort of there’s a sort of uh like relationship between these things that becomes exact in certain limits, you know. So, neural networks uh uh also admit descriptions like this like uh the simplest one is what’s called the gradient flow limit. Um this is so simple that it’s it’s often not even discussed uh you know in the same breath as these other limits but I mean but it is totally foundational. The gradient flow limit is saying okay let me take this my step size to zero. You know that in in gradient descent you your parameter update is equal to the gradient times some some learning rate parameter usually denoted ADA. Um well gradient flow says take ada to zero. Well then you might say well I’m not going to go anywhere then. Uh and I say okay well I’m going to take the number of time steps I take to be commensurately larger. something that was previously discrete, you know, 100 particles or molecules of water or whatever ends up being a continuous system that I can treat with differential equations. Uh so um [snorts] this story has played out in in a dizzying number of limits in deep learning theory. Uh you know some of some of them are uh practically useful, some of them are merely insightful. Um this like the the earlier mu transfer story about hyperparameters. There was uh uh this is derived in the theory of infinite width. You know you you like there’s very little you could say about a neural network of width 10 just like there’s very little you can say about a glass of water with 100 molecules in it because everything’s so complicated and messy and contingent on your initialization and and steps and all this stuff. Um but uh you know but but like this the systems that we’re actually using another reason another answer to the why now question is the systems we’re actually using now are so big they’re basically continuous they’re approaching continuous very interesting yeah at least in certain respects right uh like neural networks are doing so many things that they’re not they’re maybe not continuous in every way but certainly you know when you have depth 100 and width 10,000 uh which is Those are pretty typical numbers for a large model today, a language model. Um, it’s it’s it’s not surprising that some of the math that you might do about depth and width infinity, as long as you’re like careful to scale things in the right way to, you know, preserve the non-dimensional quantities, uh, aferment mentioned, um, might be insightful. [snorts] What are some things that we can learn from, uh, that we’ve learned so far? What kinds of useful infinities are there? You said neural tendon kernel. I think the MUP stuff is like infinite width. We have infinite depth. You have infinite step number of steps. Intuitively, how should I think about some of these continuous systems? Like for example, for me intuitively, I’m like, okay, what’s the difference between infinite width and with infinite depth? As I said before, I think the interesting limits split into two types. There’s those that are like realistically useful and those that are uh theoretically insightful um and maybe reveal some fundamental learning behaviors uh whether or not um you know we’re actually using that limit in practice, right? Um so uh some examples of of limits that are really inspired by practice are like oh infinite data infinite context length infinite width infinite number of attention heads infinite depth. Um the [clears throat] neural tangent kernel limit uh is one way of taking things to infinite width. It turns out uh it turns out there are two ways of scaling to infinite width. One of them gives a simpler system that does not do what’s called feature learning. hidden representations don’t evolve but it’s like mathematically very beautiful and can be used to answer certain questions in a very compact way. Um then there’s the more realistic limit which is the MUP or feature learning or rich limit. It goes by different names and um this actually does preserve feature learning and it’s way harder to study. Uh so you know but but this has been pretty clearly now converged on as like the right infinite width limit to study. Um and it’s it’s a sort of a guiding belief in like the community that we’re part of research community that yeah like pretty much when whenever you can study something at infinite width it’s a good idea to to do that because um finite width is sort of adding just like you know it’s just adding like one over width corrections is like the icing on the cake like you want the cake first and the icing is like oh how do we discretize it like if we could do it at 10 trillion layers like okay maybe we don’t need 10 trillion layers it’s kind of annoying to put on our GPU it’s just too big like can we get away with a 100? Can we get away with a thousand? Usually yes. Like Yeah. Yeah. So, so I mean it’s worth noting that uh the that the way that um computational physics solves any system, any continuous system is by discretizing it, you know. Well, how do you do finite element analysis of a of a you know like a a metal beam that’s flexing? Uh oh. Well, you don’t solve a continuous PD because you can’t even represent a real number on your computer. Well, what you do is you discretise it into a mesh and then you the mesh flexes and you’re solving some linear algebraic equations. Oh, well, how do you solve fluid flow? Oh, well, you discretize volume and then you know the fluid flows uh you like track the fluid flow, you know, at all the points in the again the mesh, right? Uh how do you solve any kind of OD? Well, like the simplest method you learn is Oiler’s method uh which is just saying it’s I mean it’s basically gradient. It’s like just discrete steps, right? uh and and like the the more complicated ways to solve ODS are also just more complicated discretizations. You know, they use high derivatives. Uh so um so like there’s a there’s a this sense that we here in this paper we give it the name discret the discretization hypothesis you know we didn’t like I mean this I think this belief has sort of been bouncing around as a sort of a thing that has needed a name um for a while. Uh it’s this idea that yeah maybe practical deep learning should be thought of as this you know it’s like really and maybe this is why scaling things up is just only making things better because oh yeah of course you sim simulate your fluid with a finer mesh of course it’ll you’ll get a more accurate description. So like deep learning is a discretization solution. Basically the discretization hypothesis states that essentially any practical deep learning system is a discretization on of a of some ideal continuous system that would have performed better on multiple axes. So you have finite data uh you know finite step size um finite yeah finite width and depth uh and all these other things. That’s a super interesting way of thinking about what a neuronet network is. Yeah. Um yeah, it makes it so much less magical as well. Like when you think about when people I think right now like zooming out for a second, when they think about AI, they’re like, “Oh, it’s like this black box. We don’t understand it.” Blah blah blah. Yeah. We also don’t understand exactly how the water moves around in the glass, right? But we know it’s not going to jump out, right? Like we just know that’s not happening, right? Because it’s just this is how it mostly works. And we don’t we can go you know as like no matter how fine we make our simulation we’re never going to find a way to like make it jump out. It’s just like this is just the nature of the system and just how it works. I think it’s really interesting to see deep learning in this continuous way. And the thing about the discization thing is like there are look there are probably better ways we can do these continuous flows, better ways to set these networks up for sure, but like at the end of the day scaling up isn’t going to like it’s doing something different, right? The scale is giving us better and better approximations as opposed to like making things that are qualitatively different in this sense, right? Yeah. Yeah. Totally. You know, and you could you could imagine also that like new I mean because real data is so rich that like new abilities become resolvable once you’ve discretized your you know mesh of all human text at a final resolution right because when you’re doing fluid dynamics if you have just like two particles it’s just not you’re just not going to get very waterfall. Yeah like you know you want to see turbulence with with you know a handful of particles um you won’t see waves. Yeah. So so I think I think that’s yeah it’s really interesting. It’s really interesting in terms of like open work in the limit bubble. Um how do you guys think about that? Yeah. So another one of our open questions is about uh well actually I guess we have two about this. So um there was a discretization hypothesis. Yeah. And I guess I was thinking also the maybe the other one is hyperparameters. I think zero hyperparameters. So um yeah. [clears throat] So so I’ll mention two. One of them is like is this discretization hypothesis true? you know like uh I I published a paper two years ago called more is better that shows that you know essentially the discretization hypothesis is true for a class of models called random feature models so I mean it’s fairly general but these models don’t learn um don’t learn uh features it’s just a random feature projection that’s static and then a linear model is trained on top of that you take learning rate to zero that’s one way of becoming continuous you take width to infinity you take depth to infinity well in transformers there’s more than just a width there’s also So the number of heads width and depth are maybe just practical necessities to have and actually just kind of obscure our view of the real deep true simpler thing it’s doing um I think is an important direction. Every time you do you you clear away one of these things you leave a simpler picture behind. So like when you after the limit of infinite width well now people generally understand like okay you know if you’re studying something width is a possible confounder. We know how to deal with it. Here are some experiments you can do to make sure you’re large enough width that it’s not a confounder. uh and it like lets you see the rest of the system more clearly because you’ve kind of factored that out. So there’s another open direction we have. Is there a model of deep learning where you’ve taken every possible limit and you’re left with something that is no hyperparameters and is just sort of I don’t know some like egg like it’s just it’s just some it’s just some platonic ideal like you know it’s just some alien creature that that like uh you know that has a minimal number of degrees of freedom. Oh, it would be just like the ideal gas law, right? In the sense of like you would have you would still have some quantities, right, about data, about other things, right? But then it would you wouldn’t have to do all this like weird other machinery super complex. Yeah, there’s this idea that that really the complexity is in the data, not the model. The model can be kind of dumb. It can be a comparatively small number of lines of Python to describe the architecture of a transformer, right? But the data set is really what you know is where the gains come from. So like so then you know this as aorist this suggests that we should go about thinking about things like oh I should make my model as simple as possible subject to it showing a handful of behaviors that I know are important like feature learning. Um and then I should ask how does this work on arbitrary data, right? So like hitting hyperparameter zero uh then would let you ask okay how does this like ideal like ball of clay you know accept imprints from the data that I put it into contact with. Cool. Yeah. Um that’s like a totally different system to study which is really exciting. Yeah. I think that’s a good description of kind of most of section 2.2 really of like how we can use these limits. I guess maybe moving on to you know some of the later sections we already touched on 2.4 how the hyperparameters can be disentangled a little bit before. So talking about maybe 2.5 and then you know kind of the universal phenomena like and moving into like the actual applications like what exactly did you mean by universal phenomena can appear across settings and tests and like what kind of implications does that have? Well, there uh one experiment that I think is pretty pretty interesting, insightful on this idea of universality. Um specifically, I’m thinking of a paper by one of our authors on the paper, Florentine, is if you take two diffusion networks, so these diffusion models, you give them kind of random noise and outcomes an image. If you take two different networks trained on different data sets and if as these models get bigger and bigger and you give them uh the same random patch of noise, at some point as the models get bigger, you find that they’ll end up producing the exact same image. So there’s this idea that as models are kind of scaling in both the amount of data they’re trained on and the size of these models, there’s some aspect that’s kind of universally shared uh among all of these different models that they’re kind of converging to similar solutions. Uh if you imagine that this wasn’t true, this would be very difficult to build a scientific theory if every time we had a different model, there’s a different theory that would need to be built. So the idea that there is some level of universality between large architectures uh makes is very promising for this idea that we could build the scientific theory of deep learning. Now what we really want to understand are what are these these shared properties? What is universal among them? That’s super interesting. What are some other examples of universality and how it shows up? Right? So I think uh a good example of this is maybe a couple years ago there was this debate between are large language models stochastic parrots or are they learning world models? Are they basically just learning to repeat uh the next word or predict the next token just on correlation patterns within the their corpus of data? Are they actually understanding something about the world around us and why the next token would be the next token? Um and and I think now the the general consensus among the community is that they are actually learning deep understandings of the world. And so this idea is actually that that would mean that different large models are learning kind of similar world models. So this idea is that maybe there is a universal understanding of the world that all these different models are converging towards which could explain why like in the example I just gave earlier about these two different diffusion networks given the same random patch of of input generated the same uh image is because they’ve learned how to map uh inputs to they’ve learned how to they’ve learned the same world model of of of realistic images. Um and so this is one example of like what what might be underlying universality is that there is kind of a a universal world model that all neural networks are converging towards. That’s so interesting like that there might be a universal world model that is like predictive of the data in the world um that despite data being shown in different orders and different types of data and models having different architectures um that like in order to predict things there’s some kind of convergence. It comes with some pretty big asterises. This is like on a given data set presumably with the with a similar architecture, the same ar totally separate data sets, you couldn’t possibly get the same like thing unless they were from roughly the same thing. Like if you have only red images and only blue images, it can’t give you the same thing, right? There has to be some sort of like thing that like similar distribution they’re drawing from or whatever, right? Yes and no. There’s a paper that was articulating this idea. They call this idea the platonic representation hypothesis. the idea that there’s sort of one true universal world model. Uh any data set you might take is kind of like you know a projection of that world model like the shadows in Plato’s cave. Um and uh you know and different data sets will capture different facets of it but as long as they’re rich enough to sort of capture the whole like paniply of things that the marionets can do then then they’ll learn similar representations. Well, uh they show some some some some you know uh like uh debatable but pretty striking experiments that show that actually you get similar representations even between um uh vision and text models. So like the data sets have no overlap you know I mean presumably they were trained on images and their captions and stuff right so like you know so so it’s I mean it’s not it’s not logically impossible that they learn how does similarity between representations get measured this is a terrific question [laughter] and such a thorny one right you can start to think about mathematically pretty easily you know it’s like let’s take two big models um and let’s say let’s say they’re both language models [snorts] and I feed in the same document the same the same prefix whatever into both of the models and I propagate forwards to some layer, you know, and and I want to know are these two models kind of thinking the same way at these two, you know, these two layers, right? And the risk here is that in high dimensions, uh, things that are actually quite dissimilar might look very similar. And so the the real you know question here the the risk here is that we might be fooling ourselves that these two things are similar in this way when but actually it’s high dimensional and dissimilar and and this kind of you know this this is a huge challenge in in kind of deepening this understanding of the universality of of of models and it’s been something that’s been people have been asking in neuroscience for long periods of time. How do you compare the neural recordings from two different organisms? uh how do you record uh compare the neural recordings from an organism and artificial neural network? So these ideas of how to how to compare highdimensional objects and ask the question how similar are they actually is uh a very difficult question. Uh but a lot of progress has been made in that direction and we expect more will be I think I think the method question here is actually the more important one than the answer. I mean the answer is important sure but like a yes or no to this question. I mean I think like there there’s some version of asking the question where the answer will be yes and there’s some version where the answer will be no you know the question the challenge really the open direction isn’t do they it’s like what exactly do you mean what is similar you know yeah in what way uh it’s the is the only open directions uh the only only one of our 10 open directions in this perspective paper where it’s really about the methods you know it’s about the the the metric um you know Uh uh my sense from kernel theory is like probably there’ll be something like some part of somehow this needs to capture like you know what linear or like what what what functions are easily learnable from a simple model on top of these representations like for example a linear model um uh and and like uh you know there there’s some math from the theory of linear regression uh that could be useful in doing this, but actually like getting everything to work with large models is really a challenge. Um and it’s like this is this is one of those this is one of those deep and tantalizing empirical questions that people have been wondering about for a long time and it’s like starting to seem like it’s got to be yes. It’s in some sense it’s got to be. Yes. But we don’t know how to we don’t know how. What is a representation? What does it represent exactly? Yeah. Yeah. And I like I Go ahead. I was just going to say one of the things that’s actually exciting to me about this theory work more broadly is this like as we start and it goes back to actually something you said earlier about like as [snorts] we you know take these limits and simplify things like we start to have tools we start to like clear things away and it becomes easier to build up and like create larger systems ask other better questions etc. And I think this representation one is like a perfect example of like the kinds of things that you could have that could have like a really big impact by understanding. It’s like if you really know what a representation is, if you really can answer this question of like what does similarity mean? It might suggest to you immediately like oh well then we should do it in this totally other way which then can be like a complete reframe and like a much simpler easier like there are things that could happen as we learn like how this stuff actually works that help us be like oh we don’t need any of this croft like you could just do it like this and it be so much faster and so much easier. We don’t know if that will happen. It’s possible that we’ll do all this theory work and be like yep we can make it like 10% more efficient. That’s the cap. Okay, neat. It’s also possible that we learn something where it’s like, oh no, like we were just getting started and doing this totally wrong and there could be this like huge shift in how we think about these things. So I think that’s for me one of the like really exciting things about this work is that it potentially has these kind of like uh new things that we can build with our understanding and new questions that we can build on top of there can really like shift what what we’re able to do and what kinds of questions we’re able to even ask. Yeah, this leads into uh something that I think about a lot, which is the idea that science is an edifice that builds on itself brick by brick. So like we’re not going to solve the puzzle with one brick. Uh and what each what each like paper what each contribution each project should try to do is lay down one like humble but very solid thing that can support the weight of the building we’re constructing. Um, and yeah, and there’s this sense that I mean to switch metaphors like a rising tide lifts boats and allows you to reach higher bricks because [laughter] all bricks. Yeah. And and you can reach higher fruit from standing on your boat, you know. So [laughter] Dan makes fun of me for my metaphors. Speaking of similarities, it’s [laughter] kind of like Jamie is a metaphor creator. It’s true. Do you have any [clears throat] kind of like preferred way Daniel of thinking about representation similarity? We think that neural networks they take their data and they build progressively up representations of that data. Um sometimes we might call those features we might call those representations and then the idea is kind of through that process going layer by layer uh they’re eventually able to take that represent that final representation and maybe predict the next token. Um and so they’re kind of like building up maybe uh let’s let’s take the the case of next token prediction. They’re building up kind of a belief layer by layer of what might be the most likely next token. Um so when I think about kind of comparing representations I to me I’m actually kind of more interested in what is what what precisely about the data what features of the data are they actually extracting. So I think this question about comparing models has a lot to do also with you know understanding the actual data itself um it’s one of our open directions is how do how should we actually even model data what is a good model of data and I think that probably answering the question of how to compare models will inevitably require an understanding of how to think about the features in data interesting yeah all of these things like all these answers that we get kind of like if you answer that question it helps you answer this other question it helps you build up these things help you assemble these bricks etc I think one thing that I’m excited headed for is kind of all the different implications of this work. Do you want to talk maybe a little bit about you know some implications that you see like what are you know what are the outcomes like let’s say we make progress we make some more of these bricks like how is this helpful to you know the field more generally to other fields etc or even today are there you know practical things that you recommend practitioners do in particular um either either version of the questions when we opened this conversation we talked about learning mechanics being kind of a physics uh underlying deep learning in in a sense in the same sense that mechanistic interpretability is a bi biology of deep learning. Um, and so talking about how these understandings can actually influence other fields. Uh, I think that there’s going to be a a symbiotic relationship in research from the mechanistic interpretability community and the deep learning theory or learning mechanics community. Um, I think that goes both ways. I think learning mechanics and the ideas in this paper and the ideas that I expect will come out of the open directions uh in this paper will have big impact in understanding mechanistic for mechanistic interpretely formal definitions of what are features and representations and circuits. uh the these are words that are used actually features in particular are words used in both communities that can mean very different things and so actually coming to a consensus of what that is and defining these formally under from first principles uh from the bottom up and the top down approach I think can have a will have a huge effect uh so that’s kind of a direction for learning mechanics influencing mechanistic interpretability and vice versa mechanistic interpretability the community around mechanistic interpretability has always help brought data to the forefront of every problem that they study. And I think that is something that the theory community in deep learning hasn’t always done. Sometimes we just treat them as XY pairs input output. And we we focus on the optimizer, the architecture, the nonlinearities. You know, we get gravitated towards problems where we want big hammers maybe or interesting hammers, stoastic differential equations, neural tangent kernels. You know, we want to use these ideas, but sometimes the data comes last. But the data is probably the most important piece of this puzzle. and mechanistic interpretability has found all these interesting insights by looking at the interaction of data and architectures and neural networks. And so I think that the theorists that take those insights seriously and really think about those as as uh a goalpost to to to understand are going to make big impact in our field. So I think there’s going to be an amazing symbiotic relationship. They’re already we already see that that happening. Uh and I hope that that’s a direction where both our communities can kind of discuss it’s very generative for problems to work on. Yeah. Yeah. Um for you guys, what are your respective like current obsessions or interests? You know, what open question are you very interested in solving right now? The two open directions I’m most interested in solving right now are open directions one and two from our paper. And these are uh simple models of nonlinear feature learning and theory that’s data aware. Um so and if you were to describe that in like a um I am so puzzled by X. Yeah. Okay. So so here’s what is the open question with the question? What are you actually what are you puzzled by? Like so here’s the deal. Kenji [laughter] uh here’s the deal. So we have these beautiful insightful solvable models. There’s two main workhorse models. Uh Daniel touched on deep linear networks earlier um which have these like step-wise learning dynamics. They learn single directions one by one. There’s another class of models that was uh called kernel methods um a kernel regression that is connected to deep learning not by sort of um killing the activation functions but by taking a certain infinite width limit not the mu limit the simplifying one simplifying one uh and then you get kernel regression out uh and this you can also solve has a beautiful theory of learning and actually this can learn functions that are fully nonlinear in the data except the learning dynamics of the network are linear. So it’s like a linear function learned after a nonlinear projection. Um so and this also shows a simplicity bias and it’s extremely beautiful and we have a complete theory of generalization that like works really well. Uh it just doesn’t apply to anything that well well yeah so so yes I mean this so so here too we have another solvable case that comes from a simplified simplifying assumption on a deep neural network. Uh but again we’ve thrown away an important form of nonlinearity. Um in this case the nonlinearity in the parameters. Uh so so yeah like but like you know we we I want to know what’s at the intersection of these two things right like I want to write down a a model that is nearly as tractable and insightful as these two guys but that gets the best of both worlds. uh I I I want a model that uh whose dynamics I can solve and study and understand uh better than like even a you know like a a shallow MLP it’s a multi-layer perceptron the simplest neural network like even that the point wise nonlinearity too complicated too annoying uh in in my in my view um it just to to like do sort of endto-end complete trajectory studies so I want something that’s simpler than that but can still learned fully nonlinear functions and still is fully nonlinear dynamics. So like yeah I mean and and my my team and I have like identified a function class of this sort. We and we’re currently you know we think we know how it’s related to MLPS. We can solve its dynamics on a large number of cases. We can predict stuff about how LPS learn from this and it seems to work pretty well in a large number of cases. And we’re kind of like I mean I’d say right now I’m I’m pretty obsessed with with getting getting that to work out right because it’s it’s like Uh having a having a like a good nonlinear model of the dynamics of feature learning would just like allow you to immediately ask so many new questions in like a very you know where the complexity uh is is is is so paired down but you kind of know that it’s still capturing something important about the system you’re studying. Uh, so like yeah, that that feels like a huge unlock to me to get to do this. Um, yeah. Interesting. How about you, Daniel? Like what are you like I’m so confused by X or like Sure. Sure. Yeah. So, this is maybe not something I’m so confused by, but maybe the arc that I see my research in the next year or so. Um, something that Jamie and I worked on together uh about two years, a year ago, uh, is we’ve talked about limits. There’s a certain limit that I was really interested in which is the limit as we take all our parameters to the origin. So if we actually set a neural network and we set all you know all your weights and parameters and you set them to zero the input output map of that network would be nothing. It would be zero and it would never train because the gradient of one parameter is always dependent on another. And so this is essentially a critical point of the loss landscape something we know. And and ideally if we moved you know perturbed slightly off of that critical point eventually with enough time the model would would learn something. And what has been shown before in other settings is that the learning dynamics of networks in this kind of vanishing initialization limit they start at a saddle point at the origin and they kind of progress through a sequence of saddles to saddle dynamics. They jump from one saddle point to another to another eventually going down towards a global minimum. And so in in some simple models, you’ll see this in the loss curve as basically plateaus followed by drops, plateaus followed by drops. And the idea is that each one of these kind of uh jumps from one saddle point to another is learning some aspect of the task. So Jamie and I worked on a paper where we tried to kind of unify a bunch of existing saddle-to-sattle dynamics uh under one picture kind of an you know we’re training this model through gradient descent but maybe there’s actually an alternative optimization process a discrete optimization process that would describe that process where each step of this discrete process is actually learning a feature. So the idea is if we understand that process we’re also understanding some aspect of the features that the network is going to pick up. M and so what we did in that paper was looked at existing works and tried to unify them. But then we also actually ended up looking at some of the works from the mechanistic interpretability community around modular addition and how neural networks learn these 4A features and we proved that using our kind of onsc features would emerge in that setting. So that kind of put me on this trajectory for the next year uh you know to now of looking at all these different algebraic tasks. So modular edition can be thought of as kind of a a a task of of a group composition under the cyclic group. So we I got interested in these tasks where we know a lot about the underlying structure of the task and we know what kind of are the right features of this task. uh and trying to understand how through training does a neural network pick up or acquire those features. So the idea the hope here is that by really deeply understanding that process um I mean you could ask like well why is this interesting you already know the solution to these tasks you already know the right features the hope is that by understanding how neural networks acquire those features that process that’s the that acquisition process whatever that method looks like that’s going to transfer to settings where we don’t understand how neural network what features are learning for example I I study how let’s take a next token prediction tasks. I might think about generating data from some uh hidden markoff model or some uh sequence generating process where I know the underlying probability distributions. I know what would be optimal to do, what an optimal learner would do if it was tasked with next token prediction and I would know kind of the right features to do that next token prediction task. And I train a neural network on data from this this synthetic data that I’ve generated and I try to understand exactly what is this sequence of steps it’s taking to predict the next token optimally. Eventually it might let’s say it gets there but how did it get there? Um and then if we understand that process that might give us some lens and when we’re training a large language model and next token prediction on the internet where we don’t know what what the bay’s optimal thing to do is that [clears throat] we could apply that process and try to understand well maybe this is the features it’s acquiring and this is how it’s then making that prediction. So we should be able to in theory predict what features it’s acquiring. Yeah. what order roughly and it might not be like predict a priority. It might just be an alternative algorithm that if you run it on the data would give you the same solution at the end but this alternative algorithm would be more interpretable. Um interesting. Yeah. So I think this idea of using synthetic data sets uh that have a lot of structure that still require nonlinear learning and studying how neural networks learn the structure in those synthetic data sets is a really promising direction. Mhm. Yeah, I think that’s a really good example maybe to like zoom out a little bit and ask about like the kind of broader implications of this work as well. You can imagine so there we’re talking about like modular arithmetic. Okay, sure. But you can imagine kind of bootstrapping from there to like okay well what about like you know by hand arithmetic where you’re like this plus this is that and then carry the one and blah blah blah which is like a language which is like something you could see if someone was explaining arithmetic to someone else. You could imagine making it richer and richer and richer to the point where it’s like, “Oh, this text looks kind of reasonable.” Like, I can understand what it’s trying to say. It like makes sense to people. It’s how you would teach someone arithmetic or multiplication or some other concept. And we have a full model of like, okay, what are all the features that the neural network is learning? Like, how is it actually making these decisions? How does it change the scale, etc., etc. We’re clearly not there today. But this is the kind of thing that seems really exciting to me because once we do have that level of understanding, then we might be able to talk about this stuff at all sensibly at like a policy perspective or like an engineering perspective or like something at a little bit higher level where it like connects to real life. But until we have the tools and fundamentals for this, it’s kind of hard to say anything about that at that level, right? Yeah. I’m never quite sure how how much to make this a central motivation for what I do. I mean c like I I feel quite confident that some form of regulation policy of AI will be necessary. uh and um [clears throat] you know I I I think it’s it’s we we we haven’t really tried many ways of regulating AI yet I would say and [laughter] and and uh you know and also things are changing very fast but yeah I mean I think it one possible future uh you know one possible way things play out is that um having a language to talk about these things in terms other than like their you know raw statistics number of tokens trained on, you know, flops for a forward pass. Um, gives us like tools to to better describe, regulate, characterize, have a sober public conversation about these systems. Yeah. Being able to attribute data to kind of the the learned model, being able to to understand the influence of maybe one set of of data points or or certain corpus of text into the final trained model whether or or into the learning process would be a pretty important tool in this conversation of regulation and uh copyright infringement and other things like this. M so you know really understanding the influence of data is uh critical not only for a deep scientific understanding but also uh kind of practical and regulatory uh framework that would eventually need there there’s also a safety angle right and I think used engineering examples before like bridge etc. It’s I think hard to make very many claims about safety that feel super grounded without really understanding the system that you’re talking about. If you just have a fully blackbox understanding of bridges, it’s like, well, it didn’t fall down yet. You can’t really get that much further, but we can do so much better once we like really understand bridges or engines or planes or nuclear reactors, right? Yeah. So, so, so we we say in the paper we believe that uh there are three types of reasons to want a theory of learning mechanisms. One is fundamental scientific reasons. Understand intelligence, understand deep learning, big scientific mysteries here. Uh one is practical reasons. you know we there are many things that we would like to do with deep learning that we could do much better if we had an explanatory theory. The last is safety reasons. Um you know uh presumably if we want to have like that’s the one setting when we where where we would most like to have guarantees and understanding, right? um like uh it’s not super critical to have very reliable chat bots most of the time. Um but you know and certainly if we’re in a scenario where very powerful AI systems are making high stakes decisions uh maybe in real in real time. Yeah. For example, running your life. Uh then then you know that we we want to be able to appeal to some kind of grounded way of thinking about these things. you know um this is also a response to the common objection you know that oh well these things will understand themselves before you ever understand them so why are you trying to do this right I mean this is a this is a concern for human intellectual endeavors across the board right now uh you know so so far from unique to us but um I think there’s a unique answer which is and and setting aside the fact that like we already understand some things and they’re being useful and also isolated pockets of understanding are useful even without the whole think but and it would be nice to use these systems to understand these dynamics in order to design better versions of these systems like we don’t only want to evolve systems that we rely on but yeah so so even setting aside those answers to that objection of like you won’t get there before they do uh well you know the safety and oversight is the really the one [snorts] setting where like unless you trust the AIS to police themselves you don’t want to totally handle over control and having some kind of fundamental theory gives his foot in the door. Yeah. It seems hard to bootstrap into a place where you do have control and oversight and and safety and understanding without this basically, right? Like Yeah. On that note, for everyone who is listening, uh for those who are grad students, like is there anything that you’d recommend they do if they want to participate in the field like work on one of these open questions or get involved in the community somehow? like reach out to you, publish a cool blog post, send you some money. Like what should they do? So I think you could see this paper from two different angles. One angle you can see it as we basically have three claims in this paper. There will be a scientific theory. There’s pieces of this theory starting to emerge and that this theory will take the form of a mechanics of the learning process. And so you could think of this paper as essentially our jux justification of those claims or you could also look at this paper from a pedagogical point of view. And I think look reading this paper and looking at the the the citations the papers that we site and the and the stories that we highlight is basically uh 14 authors in this space uh describing what we would say would be a great intro course to understanding deep learning theory. This is like the textbook for understanding deep learning theory right now where we haven’t written a textbook yet because it’s so new. Don’t know the answer. We don’t have the answers. Yeah. It’s the syllabus to an intro course. Um exactly. And so I would say that like you know understanding reading these different things and thinking as if you’re a young researcher which of these different approaches or methods of handling complexity appeal to you and your thought process and deciding to go deeper into that you know reach out to the authors uh that are aligned with those approaches and think about what open questions and open directions we we posited and and you know I’m sure they all have their own as well. Yeah. And uh the last section of the paper um is uh uh is called how to get involved in the development of learning mechanics. And and here we give um a list of like tenets of advice for young people who want to get involved in this. You know they don’t have to be young but like new you know newcomers to the field. Um and and uh you know things like uh try a few problems before going into one. um you know value scientific insight and understanding over like the difficulty of the theorem he proved. Um you do lots of experiments because they’re cheap and easy. Uh so I I’d recommend that um you know that both I’d recommend that that a a young grad student look at both uh this section of the paper and the opens directions section and and and then read the rest with this sort of eye for like you know like like try trying to hear and disentangle the stories uh that we’re telling from just this broader um mass of literature and and seeing what what reaches out to them and compels. Uh we’re also um putting together this website. Uh at time of recording the URL is learning mechanics.pub. Uh it’s meant link to in this in this perspective paper. Uh and we’re hoping that this will serve as a as a place for highquality um like perspectives from experts and uh pedagogical materials and and like open questions and uh uh maybe even a forum for discussion. Thank you both so much. This was really fun. uh and I hope that many more people who are wonderfully curious and [music] qualified will join the field and we’ll actually develop a theory of deep learning. Thanks much. All right. Thanks guys. [music] Thanks for having us.