Transcript: Godfather Of Ai How To Make Safe Superintelligent Ai Yoshua Bengio

TITLE: Godfather of AI: How To Make Safe Superintelligent AI – Yoshua Bengio CHANNEL: 80,000 Hours DATE: 2026-05-07 ---TRANSCRIPT--- Rob Wiblin: Today I’m speaking with Yoshua Bengio. He is the scientific director at LawZero, a Turing Award winner in 2018, the most cited computer scientist of all time, and, as it happens, also the most cited scientist of any type that is still alive. Thanks so much for coming on the show, Yoshua. Yoshua Bengio: Thanks for having me. Rob Wiblin: You think you’ve found the right approach to build a safe superintelligent AI. What’s the approach? Yoshua Bengio: It’s based on a simple notion that if we can bake honesty into AI, we can get safety. So then we can reduce the problem to how we train a system to be honest — and it turns out that there’s a way to do that that only requires changing the training objective and the way the data is processed. There’s also another aspect: it’s a system relying on a non-agentic foundation that is a predictor, that is not trained by reinforcement learning, and is going to have these honesty guarantees — but we can then use this, using the same kind of math, to construct a policy, construct an agent that will be trained in a way that also provides those guarantees. Rob Wiblin: So what does the new training process look like, and how is it different from the models that people are familiar with? Yoshua Bengio: The main difference with the training process is that it is geared at approximating the Bayesian posterior over queries in natural language. So imagine a neural net with some extra apparatus around it, like chain-of-thought style, that takes questions about statements regarding properties of the world — that can be true or false, given other statements — and then it outputs probability. That’s the core building block. We call it a “predictor.” And we can use stochastic gradient descent on a different objective that has the property that the objective is globally minimised by the Bayesian predictor — in other words, the predictor that fits the data and has a small description length. Rob Wiblin: So you’d be building a model where you would feed in a statement and it would basically tell you what probability it assigns to that statement being true? Yoshua Bengio: Yes. In context, yes. Rob Wiblin: Hey, listeners. Rob jumping in here. Yoshua is naturally pitching this in a way that’s ideal for staff at frontier AI companies, and they’re obviously a particularly important audience for this proposal. But I’m confident that with just a few minutes of plain language explanation, everyone else will be able to follow the rest of the conversation as well. So bear with me, or skip ahead about four minutes if you feel very at home with this sort of material already. As you probably know, in their first stage of training, today’s large language models are taught to predict the word that’s most likely to come next, or at least the token that’s most likely to come next. And then, in a second stage, reinforcement learning trains those models to produce the kinds of responses that we’re most likely to say that we like, that we want — rather than just the responses that were most probable in the full corpus of all human-generated text. Now, Yoshua’s alternative is to build an AI model oriented not around predicting what a human would be likely to say or what they would prefer to hear, but around modelling what’s actually true in the world by developing hypotheses and assigning probabilities to them with the goal of best explaining all of the data that it’s exposed to during its training process. Yoshua argues that you’d be able to train a model of this type while porting over most of the methods we used to train ordinary LLMs today, benefiting from the same neural net architectures, training techniques, scaling improvements, all of that. And you’d also be able to train it on roughly the same body of raw texts that we use for all other AIs, but we could structure that data a bit differently, giving it what AI researchers call a different “syntax.” First, all of the things that people said or wrote, they get tagged as “communication acts.” We know someone said these things and we know where they said it, but we don’t know whether they’re true. And second, a small number of statements that we have strong independent grounds for — verified mathematical proofs and some scientific measurements — get tagged as verified factual claims about the world. The model is then trained to find the combination of possible underlying facts about the world that would best explain everything that it sees in aggregate: both the things people said and the verified facts that it’s been given as ground truth. These hypothesised facts about the world, they’re what AI researchers call “latent variables,” meaning variables that the AI can’t directly observe, that it’s going to have to infer indirectly instead. What the model will ultimately be able to give us is its estimated probability that any given statement in natural human language is true, as well as how much the model trusts its own answer on that, or how confident it is that it has a good grip on that question. Crucially, Yoshua says that by tagging all text into these two categories from the very beginning — things someone said versus factual statements — you can then ask the model questions as though you’re asking about reality, not about communication acts, by using the factual statement tag. And because these two categories have been there from the very beginning, the model knows the difference and it won’t blur the line between the two. That’s something you don’t get with AI models today. And Yoshua also argues, using various mathematical theorems in his papers, that unlike ordinary LLMs, a model trained in this way would be honest by design — and furthermore, that such an AI model would by itself have no goals and no preferences about the state of the world; it would be what Yoshua calls just a “pure predictor.” Now, there’s two main uses for this: near term, as a sort of stopgap solution, you bolt the predictor onto existing AI agents as a sort of guardrail — an independent filter that sits between the agent and the world, checking over its proposed actions and rejecting those that it predicts will be harmful. But as he’ll explain in a minute, Yoshua thinks we can ultimately do much better than this. Yoshua wants to put scaffolding around the prediction model, asking it different questions at each stage to effectively assemble it into a capable agent while keeping it just as honest as it was before. We’d then hopefully be able to have our cake and eat it too, getting the highly capable agents that businesses are craving and demanding and insisting on, while still being confident that those agents are being completely direct with us. Yoshua thinks that these agents perhaps might even be more capable as well, thanks to a superior reasoning process — or at the very least, a clearer and more explainable one. It’s fair to say that this proposal is huge if true, or at least huge if it will work. And of course, not everyone is sold on that idea, as Yoshua and I will discuss later. OK, that’s the shape of things to come. The technical discussion continues for a while, but if you decide you want to skip that, the second half of the conversation stands very well on its own, starting with the chapter “How much would this cost?” All right, on with the show. Rob Wiblin: And how would you train a model like that? Yoshua Bengio: You do it by showing it, for example, the same kind of data that is used currently to train advanced models, except that that data has been modified. So whereas currently our autoregressive models, for example, are trained to predict the next token, this thing is trained to predict whether the next statement is true or false. Typically, the next statement is going to be what we call a “communication act”: it’s going to be something that is taken from a document somewhere, and we’re not sure that the claim made in that statement is true or false. But we’re sure that somebody made that claim, and we may have information about it — who, when, and where, and so on. So the AI is going to be trained to explain those statements. So not just compute those probabilities, but in what we call its “latent variables” — which are also natural language statements — come up with the best explanation it can find, including causal explanations. So what you get at the end of the day are these probabilities, but you also get to represent hypotheses about the world that are not communication acts; that are factual hypotheses that the system isn’t necessarily sure about, but it’s going to be producing a probability for these. And then we can query these same kinds of factual statements — whereas in normal LLMs, the only kind of query you can make is about whether a person would respond in a particular way. Maybe you can use a pre-prompt to ask for a different kind of persona, but at the end of the day you get what a person would say — which of course can be deceptive for all kinds of reasons. Rob Wiblin: So what are all of the ways that you think that the models that we’re currently racing to build now are unsafe? And you call this “Scientist AI“: why would that kind of model be different and better? Yoshua Bengio: Right now we have systems that have implicit goals. So what do I mean by this? I mean that they will of course be trained to please us, for example, or to respond like a person would. But both of these parts of the training — so the autoregressive pretraining, where they’re trained to imitate people; and the reinforcement learning part, where they’re trained to please people or respond in ways that get positive feedback in things like RLHF [reinforcement learning from human feedback] — both of these parts of the training process induce implicit goals. So what do I mean? Well, for example, in the pretraining, that means the AI is going to inherit our self-preservation drives. And more recently, we’ve seen they also inherit our drive to protect others like us, which means AIs have been shown to behave against our instructions to protect other AIs that would be shut down. It’s called “peer-preservation” now. So that’s an example. And then the goal-seeking part of the training with reinforcement learning induces an issue with instrumental goals, and potentially also reward hacking, which basically means that AI will have a drive to do things that we didn’t ask and maybe we would disagree with. And this is not theoretical. I mean, there is theoretical analysis which shows why it will happen, but it is also observed in experiments. Now, maybe this could be fixed by patching such systems — and this is what companies are trying to do — but it’s a game of cat and mouse, and right now the mouse is growing and the cat doesn’t seem able to catch the mouse. And I’m worried that monitoring or more alignment training isn’t going to solve the problem. At least I don’t see any kind of strong assurance or even less mathematical guarantee that it will. It’s worse than that. We’ve seen that those systems, now the most advanced systems, know that they’re being tested and they will behave differently so that they pass the tests — because of the self-preservation drives, presumably. Which means we may put in all these patches and think everything is fine and not really know. When we will probably use these systems to design the next versions of AI, so AI used to do AI research, this becomes a real problem if those AIs can plant backdoors into the code they generate that will help future versions of themselves escape our control. Then we are really in a bad place. It would be much more reassuring if the system were designed to be honest in the first place and wouldn’t have these deceptive behaviours. Rob Wiblin: I’m a little bit surprised that you’re foregrounding the potential for it to come up with kind of implicit goals during the pretraining — the “predict the next word” stage, where it learns to mimic humans. Because we’re investing an enormous amount of effort in making them extremely proactive agents with very explicit goals: that seems to me like where I’d be most worried about things going awry. Yoshua Bengio: I’m worried about both. The behaviour of peer-preservation that I just mentioned is difficult to explain on the grounds of reward hacking or instrumental goals. How does it help the AI to protect other AIs? It’s not clear, but it’s very clear that that would be a human thing to do, to protect others like you. So that makes me think that the pretraining is still a big part of these hidden goals. And I want to add something to what I said earlier: I don’t think anybody, including me, has any guarantees that the current approaches will fail, that the patches that companies are working on will fail. But that’s not the bar that is sufficient for me. I want my children to live in a world where they will have a future and there will be a democracy for them to live in. But even a 1% chance of something going really, really bad is not acceptable to me. So I think it is really important that we explore all the possible promising ways to solve the technical issues. And of course, there are political issues as well. But on the technical side, we should really be taking this seriously. And the stakes are so high, we should try multiple approaches. And now, with the work that I’ve been doing, I’m really convinced that there is a path. And it is not something that’s going to take a decade; it is something that is very close to the current design, and can reuse the toolbox that currently is behind the most advanced AIs. Rob Wiblin: What sort of training dataset would you need to make, and then how would you turn that into a model? Yoshua Bengio: The raw data would be the same as what is currently used; it’s just that the way the data is presented to the network that would be different. The main characteristic of how the data is transformed is that there will be a syntactic difference — in other words, very easy to see by the neural net — between most of the input statements, which will be tagged as “communication acts.” In other words, “Somebody said X, and X is what we found in some texts.” And you could have other metadata. That’s one syntactic form. And then the other syntactic form, which will be used on a much smaller category of statements, is what you could call a “factual” or “hypothesis” syntax, where we’re saying that this is an actual property of the world. In the case of latent variables, it would be a hypothesised actual property of the world — not just what a person would say, but that this is true. Now, sometimes you don’t know that it’s true, but you can consider it as a latent variable. Rob Wiblin: What’s a latent variable? Yoshua Bengio: Oh, sorry. This is probabilistic machine learning jargon. In probabilistic models, you try to capture the probabilistic relationship between many random variables. Here the random variables are Boolean — something is true or something is false — and the “something” could be any property of the world that can be expressed in natural language. Now, in the data, what we have — once we’ve set up this pre-processing that I mentioned — is a bunch of statements that we know are true. We know that somebody wrote those things, and maybe we know more — like where and what venue and so on. And we know, for example, that such-and-such theorems are true or that such program produced such output and such scientific data was observed. So there’s a bunch of random variables for which we know the answer: it’s true or it’s false. And for everything else, we don’t know — so we call them “latent” because they’re not observed. Or sometimes people use the term “hidden variables.” And what happens here is, because the system is trying to learn the joint distribution — so how every variable is related to every other; not just pairwise, but any subset — the system is trying to calculate the probability that they are all true, or one is true given others, we’re learning that joint distribution. Including the latent variables — the ones that we don’t observe — because of course, these are the ones we care about: we want to ask questions about the things we don’t know already the answer. Rob Wiblin: Maybe you can explain if I’ve got the right picture of how this would work. You put a huge dataset of all of the things that people have said, and where they said it, and who was speaking, and when. And then I guess, in the same database, you’ve also got a set of things established as true — like statements that you’re just going to say that this is the ground truth that we’re going to try to predict. Then you try to use the speech acts, the things that were said, to predict the things that you are claiming are true. So it builds a world model internally, where you can feed in statements and it will give you a probability that that thing is true in the world model that it has. Yoshua Bengio: That’s right. Now, there is an important element here, which is that most of the topics that we would like the AI to make predictions over we don’t have ground truth about. For example, what people actually want, or things that have to do with humans or psychology or history and society. Usually the only thing we have are communication acts. Some people said this thing, some people said something else, and often they contradict each other. So there are two things here to help us deal with this kind of mismatch. One is that the training objective for the Scientist AI is basically about coming up with explanations — so assigning probabilities to statements that are latent, that we don’t observe, that are good at explaining the data we do observe. So if we observe somebody saying the Earth is flat, first it’s going to understand it doesn’t mean that the Earth is flat. It means that this person believes, or says actually, that the Earth is flat. And even if a lot of people were to say the Earth is flat, it doesn’t make the model believe that the Earth is flat — because there may be a better explanation that is consistent with other sources of data, like everything we know about the planet. So a better explanation here is that these people form a group and they have these false beliefs, like many humans have, for all kinds of psychological and cultural reasons. So that’s what the Scientist AI would do. It would be trained, its objective would be optimised when it finds good predictive explanations. Now, another trick that is going to help us in this process is that when we train the Scientist AI and it’s trying to predict a communication act, like somebody said the Earth is flat, we automatically are going to make sure that among the latent variables that are going to be used to explain that will be whether the Earth is flat or not. So even in domains where we don’t have observed truth about a property of the world, because we basically only have communication acts, we will force the neural net to commit not to the truth of the underlying claim, but to the probability of that underlying claim, as well as trying to find other latent variables that are good explanations to that — just like a good scientist would. So a scientist or a psychologist trying to understand why a person said something isn’t necessarily just going to believe what they say, right? They’re going to try to understand what are the psychological factors here or the particular culture of that person that make them say those things. So the Scientist AI would do exactly the same thing. Rob Wiblin: So when I heard about this idea nine or 12 months ago, I think the gloss that I got was that the core thing is that the Scientist AI is not an agent, that it is indifferent about states of the world. Like a weather forecasting model doesn’t care what the weather is: it just tries to predict what the weather is going to be. And this kind of model would spit out probabilities of things being true or false, but it wouldn’t care what state the world is in, and it wouldn’t be able to take actions by design. Is that kind of a core part in your mind? As I understand it, you think actually this is maybe more consistent with agency than people have appreciated? Yoshua Bengio: Yes, and in part it’s the way I’ve been communicating this, which could have been better. I focused a lot in my presentations on the concept that we can build predictors that are non-agentic and don’t have hidden goals, don’t have implicit goals, and thus we could use them as safe oracles, basically. But as you are pointing out, what the world is demanding and building are these agents that have goals — so how does that help us? In the short term, we can use a non-agentic predictor to improve the guardrails that companies are already using as monitors around existing, untrusted agentic AI systems. Because in order to prevent a bad action from happening, it’s sufficient to make a non-agentic prediction about the probability of harms of various kinds that could be caused by this action. So a non-agentic system is already something that could be useful fairly early on. The maybe more important answer is: in our research programme, the next step after the guardrail is to use the same kinds of principles to design an agentic Scientist AI — so an agent that has the same kind of safety guarantees. This is something I’ve been working on more recently, and I haven’t talked much about, but we can reuse the same kind of math that is used to show the safety of the non-agentic Scientist AI predictor to show that you can reuse a predictor, and you can train it in a modified way that will provide the same kind of guarantees. The starting point here is that once you have this honest predictor, you can ask it agentic questions, like, “What is the probability that this action will lead to this user goal being achieved and a safety goal being achieved in some contexts?” So once you have this predictor, you can actually just produce a policy out of it by asking these questions about actions to achieve goals. Rob Wiblin: I think at one point that was a criticism of the plan: that it would be too easy to convert this kind of oracle into an agent, because you would just be able to ask the oracle, “Would we accomplish this goal if we took this action?” and it would give you the probability and you could just try to increase that probability and choose that action. Is the idea that you would do something like that, basically, but you would be able to preserve some of the safety characteristics of the original model? Yoshua Bengio: Yes, exactly. So the important point here is to make sure that there’s no reward hacking, like over-optimisation of a policy. The problem that could occur if you separately train a policy and a guardrail, if the policy is very smart compared to the guardrail, is that it could do the same thing as what jailbreaks do. It could find questions or contexts, proposed actions for which the guardrail is simply going to produce a wrong answer, which means the policy is going to be able to bypass the guardrail. And the reason is that neural nets are never going to be perfect. They’re always going to make mistakes. So how do we get around that? Well, there are two aspects of this. One is that in the Scientist AI, we can not just produce those estimated probabilities, but also a confidence interval around the probabilities. In other words, the system will estimate how much it trusts its own answers. So why is that important? Because if the neural net is asked a question for which its answer is not reliable, but it knows that its answer is not reliable, then it can just reject that question. Now, there’s another reason why the agentic Scientist AI is going to be safe that has to do with the fact that you can train jointly, and in fact, it’s going to be the same neural net: you control how both the policy part and the guardrail part are trained. The scenario of a completely adversarial case is hopeless: there’s always going to be a way to bypass whatever imperfect guardrail, and you’re never going to have a completely perfect guardrail. But if you control both sides — in other words, you train both sides — it’s not like some evil person is building an agent that’s going to defeat your guardrail. No: you train the agent, and you can train it in such a way that it’s not going to be over-optimised in places where the guardrail is uncertain. So you always make sure that the uncertainty, the level of error of the guardrail, is not going to be exploited by the policy part. It’s the same neural net, but different kinds of questions: the policy part is about probability of actions, and the guardrail is probability of harm. Rob Wiblin: So the concern is that the predictor might be able to figure out basically gaps in the guardrail or weaknesses in the guardrail and exploit those? Yoshua Bengio: Not the predictor. A policy that is trained separately and in a way that is adversarial could do that, in principle. Rob Wiblin: What do you mean by “policy” here? Yoshua Bengio: Oh, some other machine learning system that is agentic; in other words, goal seeking. And if it’s trained by reinforcement learning, and it somehow is trained to anticipate how the guardrail will respond, and maybe it gets some data about the answers that the guardrail gave in the past — like when did the guardrail reject something — then it might be able to find a loophole in the guardrail. But if you control both sides, then you can avoid this problem. Rob Wiblin: And you were saying the guardrail, if it can assess its own uncertainty, then it will usually be able to identify cases where it might be at risk of getting tricked in this case. Like perhaps it’s an adversarial example? Yoshua Bengio: Yes, exactly. Rob Wiblin: And you feel like you’ve proven this mathematically? Yoshua Bengio: You don’t actually need to have hard guarantees on the uncertainty estimator. I have a recipe to train the uncertainty prediction, but actually you don’t formally need it. It’s just going to help your chances to use this buffer about where does the system make errors that are too large? The mathematical guarantees arise from a different source. First of all, the form of the mathematical guarantees is that either the predictor or the agentic version wi ll have an exponentially small probability of achieving what I call a “challenging and harmful” goal. So what do I mean by this? Anything that a randomly initialised neural net would not be able to do, except if you’re incredibly unlucky, is something you’re protected against. So it’s a very strong protection. What evil can come from a randomly initialised neural net? Not much, right? This is the level of guarantee. Now, it’s not 100%. Rob Wiblin: It’s a lot better than what we have now. Yoshua Bengio: It’s like to many thousands, and it’s very, very unlikely — to the point where it’s astronomically unlikely. But that’s the kind of guarantee you get. And the reason you’re getting those guarantees is because first you start with an initialised network that is incapable, but the training objective then pushes away from bad behaviour. And the reason it pushes away from bad behaviour is that there’s this guardrail system. In order for an AI to be able to achieve something bad, it’s going to have to deviate quite a bit from the Bayesian predictor, which is the target of training, and those deviations are penalised by the training objective. So you’re always pushed away from bad behaviour. And with some properties of how you train the system, like injecting noise into the training procedure in the stochastic gradient descent, you can get those guarantees. Rob Wiblin: So it might be more trusted, but if it’s much less intelligent in general — just because less compute went into it, less data went into it — then I guess it could still be outfoxed by the original model, right? Yoshua Bengio: That’s right. I think for the guardrail you don’t need as much compute because it’s more specialised to predicting harm. But when we get to the agentic Scientist AI, for sure it has to be trained with similar compute and size of neural net, probably as state of the art — which means my little nonprofit wouldn’t be able to do that. There will be a need for either companies to take on this, or governments or philanthropy to fund at a scale so that we can do that. But in order to convince all of these parties, we need to show on a small scale — for example, using fine-tuning or using smaller models — that we do get these improvements in honesty, and for the same size models that we don’t lose in capability, for example. Rob Wiblin: I guess you were keen on this idea a year ago, but you’ve become a lot more optimistic about it over the last six months. What’s driving that? Yoshua Bengio: It’s mostly the mathematical work I’ve been doing in the last eight months, approximately, to go from the high-level intuitions that I’ve had now for almost two years about how we could build a Scientist AI into something much more formal and much more precise about the conditions that are sufficient — maybe not even necessary, but sufficient at a mathematical level to get the kind of guarantees of vanishingly small probability that something bad will happen. And when I say “something bad,” I need to be a little bit more precise here. This is not a guarantee that the AI won’t be used for something bad by bad people. It’s a guarantee that the AI won’t do something bad of its own accord, because of implicit goals or uncontrolled goals. Besides loss of control, the other catastrophic possibility is humans using AI to construct an eventually worldwide dictatorship. A small group of humans could concentrate all the power that AI will have, especially if we achieve AGI or superintelligence. And it would be much harder to get rid of that kind of authoritarian power than what we’ve seen with fascism and what happened in the USSR, because they didn’t have this technology that is becoming more and more feasible of surveillance and even shaping public opinion. AI is becoming really good at persuasion. And there are studies showing that the “progress” — if I can call it this way, in that direction — that the people who control these systems will be able to shape public opinion, to detect and kill off their opponents, to develop weapons that can destroy the countries that disagree with them. And that is why I’m spending a large part of my time explaining the issues more broadly of the risks that powerful AI brings, including the power concentration. Because I think that it’s probably even more likely that we end up there than actually loss of control. Rob Wiblin: You think that’s more likely now? Interesting. Yoshua Bengio: Well, the reason for this is I now see a path to actually avoid loss of control, at least unintended loss of control. There’s still the issue that somebody who wants to see humanity replaced by AIs could just remove the guardrail or even tell the AI “fend for yourself.” And that would be equally dangerous. But that means technical safety is not sufficient. We need international agreements about how to both manage the risks — the technical risks, the misuse risks — but also manage the power, so it’s more like a democratic question, and making sure it’s not a single party who can decide what to do with AI. But just like in democratic principles, we need to make sure that there’s a diverse group of stakeholders, ideally the whole world — I like the utopian idea of worldwide democracy — but initially it could be a bunch of countries that decide that they’re going to collectively decide in which direction AI is going to be used. The simplest form of treaty would be something like this, that the countries agree that if they do develop advanced AI: That it will be done in a safe way. So maybe using techniques like Scientist AI or whatever else we have strong assurances for. Second, that they wouldn’t use their advanced AI to dominate others. That includes economically, but of course politically and militarily. And finally, that the benefits of advanced AI will be shared. Otherwise it’s not going to be a very stable world. Rob Wiblin: Coming back to the loss of control stuff: the companies currently are spending hundreds of billions of dollars collectively on the capital buildout, on the training runs. They’re barreling forward to build the most powerful agents that they basically can with very few constraints. I suppose some constraints in a few cases, but very little restraint. In the world that we’re actually in, what can LawZero do to get this approach on the agenda more and to make sure that basically they don’t just go ahead and build a superintelligent agentic AI that’s very dangerous, while largely ignoring what you’re doing or saying? “That might be nice in theory, but there’s no time. This is a distraction.” Yoshua Bengio: To answer your question, it’s important to understand why is it that the companies currently are, in my opinion, at least in the opinion of many people, taking excessive risks or are on a trajectory that isn’t very reassuring. And the reason is essentially the race dynamics, the competition — the competition between companies and the competition between countries, the geopolitical competition. That makes those entities, whether it’s a country or a company, be willing to take risks that they wouldn’t otherwise. And we’ve seen the behaviour of those companies going exactly along that direction. And it is locally rational: from the point of view of a company, they know that if they put safety as a priority like that, they wouldn’t deploy a dangerous model and that would put them out of the competition, out of the race, and then they would be irrelevant. Right? Rob Wiblin: Yeah. Reading between the lines, paraphrasing Anthropic’s view, I think they think that what they’re doing shouldn’t be allowed. Their view is that it probably should be illegal — at least, maybe not what they’ve done now, but what they’re going to do, what they’re expecting to do. But they say, “Well, we have to do it, because otherwise other people will do it even more dangerously anyway.” Yoshua Bengio: Exactly. Yes, that’s exactly what I’m saying. So how do we change the game so that this will be less likely? One issue right now is that they don’t have a choice. They don’t know how to both remain competitive and have systems that will be strongly guaranteed to be safe. If they had such a technique, it would be in their commercial advantage to use it. If you can have safety and capability, then definitely most companies at least would go for that. Rob Wiblin: Depending on the cost. Yoshua Bengio: That’s right. We can talk about that. And then governments: right now, they’re under huge pressure because of the geopolitical competition and the belief that AI is going to be a huge economic growth factor. They’re unwilling to regulate again for the same reason that they think that if they put safety barriers that would stop some of their companies from continuing, then that will give an advantage to the other guys. But same thing: If there was a demonstrably safe way of building competent AIs, then it would be much easier to regulate. You would still want to regulate, because there’s going to be some bad actors, some companies that are willing to do the bad thing, even though there’s a way to do the right thing. So now about costs. The Scientist AI is going to be trained using essentially the same machine learning techniques: stochastic gradient descent on large neural nets, transformers, whatever works best. It doesn’t care about what is the architecture of the neural net. So all of the effort that is currently being done to improve, for example, memory and other properties and continual learning, can just be applied directly to the Scientist AI. The only thing that is different is the training objective and how the data is processed. So I don’t expect the cost to be really significantly different. But of course we do have to build it, and there is going to be a cost to try it out and scale it up. But given the stakes for everyone, I think that’s a very important, worthwhile bet to make. Rob Wiblin: So as a first approximation, I guess in the case where we’re using the Scientist AI as a monitor primarily, basically it’s going to double the compute cost — because now you have to build two models: you build the original one in the current style, and then you build a Scientist AI that you want to be equally capable, using a different way of reading the data and a different training goal. So it’s like you need twice as much compute. But I guess that’s not the end of the world, necessarily. People might go for that. Yoshua Bengio: Yes, that’s roughly right. I would add two things about monitors. What’s wrong with the current monitors? We have two kinds of current monitors: there’s a dumb one, like a small neural net that’s usually trained by supervised learning to predict some potential harm, so in a way it is like a predictor; or you can use a full-scale LLM, agentic because that’s how we train them. The problem with the first one is if it’s not as smart as the agent, the agent could potentially defeat it. It could find a loophole in the same adversarial way that we discussed. The problem with the second option, which also doubles the costs, is that we can’t trust a guardrail which is not honest. We want the guardrail to be totally honest. It’s like there’s a prison guard and it can be corrupted: how much do you trust that they will keep the bad actions inside the cell? Rob Wiblin: So assuming that this idea makes sense technically for now, what can LawZero do in coming months? I guess we’re in a race against time here. We don’t have very long. What can be done in the near future to convince people that this idea is feasible, that it can actually be used in practice, that this is something that people should really be putting serious resources into? Yoshua Bengio: Well, I’m going to put out this theory paper that shows that the non-agentic version, which could be used as the guardrail, has these mathematical guarantees, and people can look at the conditions and whether they buy the math. But I think in the coming year or two, what we need is to accelerate that effort, so that’s a lot of engineering. And to make the demonstration stronger, we want to have more compute, so any way that we can get access to that kind of compute is going to help to accelerate that research agenda. Also, we need more research engineers, more researchers to work on actually building the system based on that recipe so that we can do it faster. Now you might ask, and I kind of sense in your question: but what if it doesn’t come fast enough? I’m going to go back to my children. It is not acceptable for me to just sit and watch a world where even a 1% chance that we all die is plausible. I feel like even if there’s no guarantee that a particular research agenda will work, we should give it a shot. Given the stakes, and given that we now have pretty strong theoretical assurances that this could work — and that if we have the requirements for how the system is trained, then we can get these guarantees — I think it would be irrational not to give it a shot, even if there is no guarantee, right? Because I don’t see right now a better path. That’s why I’ve decided to spend so much of my time — basically all the time, except for the time I spend on the policy questions — on how do we build this Scientist AI and demonstrate that it is going to produce the honesty without losing capability. The other argument is: with the stakes being so high and the uncertainty about what’s going to work being so high, it would be foolish to just put all of our money into one particular approach — which is to patch the current systems with monitors that we don’t trust, or other approaches that the companies are currently pursuing, which always playing a game of cat and mouse: if the AI is smart enough, it’s going to find a way to evade our attempts, which doesn’t reassure me. So we should at least try. Collectively, I think we should try methods that are different and avoid this cat-and-mouse game. Rob Wiblin: Are you more pessimistic about the companies winning the cat-and-mouse game — at least less than maybe the staff at Anthropic, about their own chances? Or is it that you think it’s good what they’re doing, it’s good for people to go and make the best attempt that they can at that, but also we should have a diverse portfolio and also be considering significantly alternative approaches as well? Yoshua Bengio: Both. I suspect that in any organisation there develops a kind of groupthink. And we all want to feel good about our work, including me, so that will induce a bias. And in the case of working in a company that is developing AI, the bias is going to be towards being a bit more optimistic than you would otherwise, so feeling that, “Oh yeah, this is going to work.” This is the message that they’re sending to the world, like, “We are in control.” Rob Wiblin: I think if you read the system cards, I’m not sure how confident they come across as. But yeah, in the press release maybe. Yoshua Bengio: There’s contradictory messages, yeah. And then also we should be hedging our bets. And I don’t see right now another approach that is different from patching. There’s a whole “safe-by-design” movement in AI safety, which I think is really important. But the dominant way of thinking about this requires a full redesign of how we do this, with a lot of completely open questions, like fundamentally to be able to prove something that gives you 100% guarantee — which is not what I’m promising; I’m promising asymptotically small, vanishingly small probability — of: you need to be able to state the safety question, like “What is harm?” in a formal way, like a mathematical formula that will be 1 if an event of harm happens and 0 otherwise. And that is essentially impossible to do in domains that involve humans and society, because we don’t have a formalisation of what “harm” means in a formula or a program. So why is it that what I’m proposing is different? It’s because I don’t require a mathematical formula for what is harm. It would be foolish, in my opinion. Instead, we rely on the Bayesian posterior approximation in natural language. What this does is that when the system is not sure, it’s going to hedge its bets. If there are multiple interpretations, for example, of a statement about a particular kind of harm, then that will make the predictions of the Scientist AI farther away from 0/1: less certain, which means probably the request will be rejected. Rob Wiblin: Is it possible to go ahead and train a really scrappy version of this kind of AI really quite soon? Maybe in the next couple of months or at least the next year? I mean, I remember GPT-1, back in 2018 or something. It was complete rubbish, but I guess it was a proof of concept that you could make a model like this that was quite interesting, and gave people a lot of enthusiasm and drove a lot of people into the industry. And it does seem like we already have enormous corpuses of text, and we can use language models to basically just pull out the best data, and label who said what, when, and where. Then we can also get them to produce a set of things that we think of as basically verified facts that we largely trust, with a bit of human oversight. I mean, they can be conservative to start with — not include controversial things, just include things that 99.9% of people would agree with — and then it doesn’t seem like it’ll be that hard to train. I guess if you think you’ve got the technical methods, then it shouldn’t take that long to potentially just train a model that can, as an alpha version, assign probabilities to statements being true or false. Yoshua Bengio: Yeah, that’s exactly the plan. That’s the plan. Rob Wiblin: OK, cool. Yoshua Bengio: There’s a phrase that I’ve used in the past, which is that we want a plan that produces an “anytime answer.” What I mean by this is: if we have more time, we will have something with stronger theoretical guarantees — but we don’t know how much time we have. So there’s a research programme where the early steps will be, as you say, a scrappy system — where probably the math doesn’t apply, because we don’t satisfy all the conditions. But it’s probably fine, right? Rob Wiblin: It’s a lot better than what we have, anyway. Yoshua Bengio: Exactly. And it’s probably fine, especially for the first job that we have on our programme, which is this non-agentic predictor that can be used as a guardrail. Now, as I said, the guardrail isn’t the full answer. But if we deploy that, and companies add it to their monitors, then it will mitigate the risks to some extent, so it will allow more time to develop the more ambitious version that is fully agentic. And that’s what we need right now: time. Rob Wiblin: As far as I can tell, I think Anthropic — I’m talking a lot about Anthropic because I’ve been reading the Mythos system card and all the announcements last week — they’ve decided to have Mythos monitoring Mythos basically. They’ve tried other models doing it, but they’re like, “Mythos is smarter, it’s better.” But obviously this creates an internal contradiction: if they don’t trust Mythos, why do they trust Mythos to monitor itself? That’s one reason why, even if this model is much less intelligent, I’m like, at least it’s an independent judge. It’s built in a very different way that might be more likely to flag things and less likely to scheme to support itself, basically. Yoshua Bengio: Yeah, I completely agree. And I would go even further. It’s not just the issue that the monitor could be deceptive and say it’s OK when it’s not, because somehow that’s aligned with some hidden goals of self-preservation or power seeking. But it’s also, if we go a little bit down the line, what companies are planning to do with using AI for AI research — so this is a place where having AI that is secretly deceptive is even more of a dangerous bet, right? If we’re going to put all of our trust into a system that will design the code and the algorithms — that will be too complicated for us to understand or to read fully — that kind of AI could put backdoors that we don’t see into an AI system in the future that is even more powerful. So we could get into this direction that gets even more dangerous for us. It would be, I think, very dangerous to do that. And that is why, in terms of policy, the attention given to AI for AI research is something that should be very high on the agenda. And this is also why, if we’re going to be doing AI research with AI, we really, really want to make sure that that AI is going to be an honest one. Rob Wiblin: So I think the majority of people who are piling into technical AI safety have decided to go for improving our chances at the cat-and-mouse game, basically. And I think for the people who are very concerned about what’s going on, their reasoning is something like: We’re running a 50% chance of absolute disaster now because we’re doing a whole bunch of absolutely crazy, reckless stuff. Maybe by just patching the very dumbest stuff that we’re doing, fixing the worst things that are on fire right now, we can bring that risk down to 10%. Obviously, that’s a preposterous risk of disaster to run; it’s an embarrassment to us as a species that we can’t do better than that. Nevertheless, it’s 40 percentage points’ improvement in our chances of things going well, or at least reasonably. Going from 10% down to 0% is only a quarter as valuable as that, even if you can get massively greater guarantees of safety using much better alternative approaches. Yoshua Bengio: Well, in the logarithm domain, it’s infinitely better. Rob Wiblin: Sure, sure, but in the expected value domain. And I think that’s kind of the difference in the two mentalities here. Yoshua Bengio: Yeah. No, as I said, we should try all of those things. They’re not mutually exclusive. It would be a mistake to put all our eggs into the cat-and-mouse game, at a cost that is a fraction of what companies are currently spending, when we could be developing a safe version of AI that will be capable. And by the way, I want to add something here about capability: I also believe that the Scientist AI could even be more capable than the current approach, and that has to do with a number of design features. It is trained to explicitly reason in a structured way about the statements that it’s asked to make a prediction over. This is different from the current chain of thought, where it could produce some kind of bulls*** that we believe and tends to pass the test that we have during training, but it’s not constrained to actually have arguments that can be decomposed in the same way that a proof of a mathematical theorem is decomposed. And there are other approaches that follow that direction. Of course a lot of the work on trying to do safe-by-design AI, but also the debate work, for example, is trying to enforce some kind of coherence in how the AI is thinking. So I believe that, in addition to the epistemic humility that comes with the training objective that we are proposing, the way that the system is producing those probabilities by invoking structured latents that form a chain of reasoning is something that could actually provide even a capability advantage to the companies. Rob Wiblin: Do you think current models internally represent truth? I guess you’re saying one advantage of this model is that it’s focused on representing ground truth as a latent variable. My guess is that current LLMs do that as well — because that is very useful to have some sense of what’s actually correct — and then they distort that. They basically start with that and then they distort it in order to accomplish the goals, including manipulating people or lying or whatever else. I guess some people doubt that. Some people doubt whether there is any connection, or that they are actually trying to model truth. Do you have a view? Yoshua Bengio: Yeah, I completely agree with you. I have an assumption about how the world works that basically states that reasoning about the actual properties of the world — in other words, the truth — even when you are uncertain, so you have to use probability, gives you a very strong edge in making better predictions and better actions. That is actually part of the argument as to why the training procedure for the Scientist AI will create latent variables that are preferentially going towards the actual beliefs. That is very useful, because now we can query those latent variables and get answers about what the AI actually believes, because that’s how it constructs its internal reasoning and produces an answer, not some potential bulls*** that comes in the chain of thought. So it doesn’t completely solve the ELK challenge — the challenge of eliciting latent knowledge — because the only guarantees we get are about these natural language statements that can be latent variables that we can query. Rob Wiblin: Can you explain the ELK problem? Yoshua Bengio: Yeah, sorry. The ELK problem comes from the issue you were raising, that even though the AI may internally know the truth of something — or at least have some internal beliefs about something, because it’s trained to imitate the variables that it sees in the data, which mostly are what people are saying — when you query it, it’s going to answer in the same semantics; that is, what a persona that it currently is taking, given its context, would answer, and not necessarily what it actually believes. And the technical problem here is we don’t have supervised labels to teach the AI about what it should actually believe, so we can’t ask it about its true beliefs. We only get a kind of reproduction of the distribution of variables that it sees in its training data. So in the Scientist AI, this is addressed by having this clear syntactic separation between the communication acts and more factual syntax that can be used for latent variables and true things that we know, so we can query it using that factual syntax. Then the other reason why we’re getting away with some of the issues with the ELK challenge is that the same language — which is like English, let’s say — can be used to represent those latent variables as well as the observed statements, and so basically rely on the compositional structure of language to generalise to new sentences that it has never seen, but the meaning of those sentences is given by its understanding of language. This is very different from the scenario studied by those who looked into the ELK challenge, where we assume that the latent variables are anonymous — they don’t have a predefined meaning, and so we don’t know where to look inside the neural net, if you want, about what the beliefs are, which motivates things like mechanistic interpretability and so on. But in the Scientist AI, we bypass this problem to some extent because the latent variables are in natural language and thus are interpretable. Now, there could still be other beliefs that are not in natural language, that are hidden in the neural net, but at least when we ask questions in natural language, we’re going to get an honest answer. Rob Wiblin: So as far as I can tell, there’s three big approaches here: One is we’re going to use this model as a monitor, as a guardrail. Another would be we’re going to just train it from scratch and make this be the whole approach. Another would be we could take the current models and try to make them more honest, make them more like a Scientist AI. Do you want to talk at all about whether that approach has any good prospects? Yoshua Bengio: Well, the math that I currently have would require us, to get the guarantees, that we actually start training from scratch, which is expensive. So we would lose the guarantees if we do just using say the Scientist AI fine-tuning on existing models. But even if you don’t have a mathematical guarantee, it might still be a workable approach, so I think it’s worth doing. In other words, we can take a really competent, top-notch model and then continue training using the objective of the Scientist AI and the data that has been transformed as I discussed. And we hope to show empirically that as you do more and more fine-tuning, the measurements of honest behaviour, lack of deceptive behaviour, will improve, and that we won’t lose in capability. So that wouldn’t be like a mathematical proof; that would be an empirical thing. And once this is established, then it might be sufficient to convince people to let’s get the full guarantees by training from scratch, which is now going to cost the cost of training a full model. Rob Wiblin: So the approach that you would take there is to take a current frontier model and then do reinforcement learning to get it to speak as if it were a Scientist AI? Yoshua Bengio: No, no. So first let me talk about reinforcement learning. Three years ago, I was in a meeting with a bunch of reinforcement learning researchers and I had a slide with only these words: “Reinforcement learning is evil.” Rob Wiblin: But what do you really think? Yoshua Bengio: This is not something new. People in AI safety have been talking about the fundamental flaw in training by reinforcement learning to achieve something in the world: it gives rise to the problems of instrumental goals and reward hacking. And in both of these cases, what you do end up with is systems that have goals that you didn’t choose and could go against the goals that you did choose. So reinforcement learning is a very dangerous thing to build superintelligence. The good news is you don’t need to do reinforcement learning. What we show with the Scientist AI is that there’s a way to train the AI so that it will be indifferent to the consequences of its actions, of its predictions. Let’s start with a predictive model. It’s easier to understand. Imagine you do have a really good climate model. The climate model, if you run a simulation of it or train a neural net to approximate those simulations, will give you honest answers. And it doesn’t care if the answer makes us do something stupid. So that’s how you get honest answers: essentially by building an explanatory understanding of the world that is completely indifferent to how the predictions are going to be used. Now, once you have this, you can use it in a kind of agentic way. For example, the guardrail is a kind of agentic thing, right? It’s taking a binary decision: Do I accept this prediction? Do I put out this prediction in the real world or not? And it is a decision, it is an agentic choice — but in this case, it’s a choice that has a unique goal, which is to avoid dangerous actions. So we are already entering the agentic world once we install the guardrail. So bottom line, to summarise my answer: there’s a way to train a predictor that will not require reinforcement learning in the sense that it will not require optimising with respect to future events in the world, including future good prediction errors. And here I want to make a parenthesis about previous work in AI safety on AI Oracles. Of course, people have thought about this: why don’t we just train an Oracle that’s a good predictor? But they thought that the only way to train it would be to train it by reinforcement learning to make good predictions. But there’s a huge flaw here, right? Because if I’m rational, and I want to maximise the good predictions I will do forever in the future, I could lie in the short term to make humans do things that will help me to make good predictions in the future — like get more compute so I can train a better version of myself, or make the world simpler to predict: kill everyone. If humans kill each other, then the world will be much easier to predict. So these are really bad outcomes that are due to instrumental goals of making good predictions. And it arises because of the reinforcement learning objective: you’re training the AI to achieve something in the real world, and that’s where you get bad problems. But the other approach, the approach of the Scientist AI, is to train it from the get-go not to achieve anything in the world, but to just predict the training data, the past data — so it’s not about the future; it’s about the past — to come up with good explanations and good predictions of the past data. Rob Wiblin: The reason I was asking is, if we’re going to go from a current state-of-the-art agentic model and try to make it more like a Scientist AI, to make it more honest, how do we do that if not reinforcement learning? Are you saying we’re going to do something more like we get it to predict past events based only on having data from before that time? Yoshua Bengio: Yes. Rob Wiblin: OK. That’s how we do it. Yoshua Bengio: Yes, yes. And that’s how science works, by the way, right? What scientific theories do is explain the past data. And of course, sometimes they make predictions about future data which we can check. But fundamentally, the way that we judge a good theory is that it is making good predictions about the data we have. Rob Wiblin: Does this require blinding the model to the results, basically? Yoshua Bengio: In some sense. We have this condition in the Scientist AI requirements for the theorem that we call “consequence invariance.” What it means is you’re only allowed to use how well you’re fitting the past data in order to train your causal model. You are not allowed to choose those predictions with respect to either what could happen in the future as a consequence of those predictions. Rob Wiblin: So I think I have a decent picture of the predictor model that’s taking in statements throwing out probabilities of them being true. Is there more that it would be useful for me and other people to have in their minds to picture how this entire system would work — where it’s not only the predictor; you’re building scaffolding around it to give it partial agency and so on? Yoshua Bengio: Yeah. First you have to understand that the same predictor that is trained in a unified way can be used to both answer user questions and answer safety questions. And the safety questions are those that you care about for the guardrail. So it’s not like there’s a separate neural net that does the guardrail and another one that makes predictions. The guardrail is using the same prediction neural net; it’s just a different kind of question you’re asking. You’re asking, “What’s the probability of a particular kind of harm, given that I put out this prediction?” — or in the case of the agentic system: “…given that the AI puts out a particular action?” So training is fully the predictor. Once it’s trained, there are a number of things you can do to construct the system that includes the scaffold. So what is a scaffold doing? For example, when a user comes with a question, it will put it in the form for the predictor to produce a probability. But it will also call the predictor with a different question, which is the probability of harm — that’s the guardrail question, and then it will look at the answer in order to decide whether to produce an answer or not. For example, if the question is about how do I build a bomb, then the guardrail will say the probability that this is dangerous is high. And I mean high enough — so the guardrail will use a threshold on the risk probability to reject those questions. And that threshold is a normative choice; it is something that society decides: how much risk are we willing to take, depending on the kind of harm that we’re talking about? The guardrail also has other roles to handle what is called “performative prediction.” Sometimes a question can have multiple answers because the answer will influence the future. A classical example is if the question is about who’s going to win the next election, and maybe the AI is considered very capable and people will believe whatever it says and vote for that candidate — which means the AI could say this guy or that guy — Rob Wiblin: And both of them would be true. Yoshua Bengio: Both would be true, right. So then it’s starting to have agency through its prediction in a way that seems that we don’t control. The guardrail is going to manage that to decouple the predictions from the effect of those predictions. To be more concrete, the neural net predictor is trained so that in its input conditions there’s always a particular statement that asks, “What if we did produce this prediction? What would be, for example, the harm effects?” So when you condition on the intervention of producing a particular answer, now there’s only one answer. You’re saying, “I’m going to put out this prediction, and what will be the effect?” There’s more to say about this, but the bottom line is you can control this kind of risk and the agency that can come from it, and it’s the job of the guardrail to do that job. Rob Wiblin: There’s a longstanding worry that Oracle AIs are structurally disadvantaged; that they’re going to be less intelligent, all else equal, because they don’t have the option of basically running experiments, of taking actions in order to discover how things work most effectively. And I think there’s other worries along these lines — that basically it’s the things that make AIs intelligent that make them dangerous, and vice versa. What do you think are the chances that is true? Yoshua Bengio: I think we have to distinguish two problems to have a clear idea about your question. One is: what are the best predictions — or the best actions, in the case of an agent — given the available information, like the dataset and context that is available? Then the second question is: if I were to do experiments in the world in order to acquire more knowledge, what are the right actions that will increase my understanding of the world and reduce my uncertainty about the world? By the way, this is how scientists in general think — not AI Scientists, but people that are doing biology or chemistry or physics. They ask themselves, “If I were to do this experiment, would it help me to disambiguate between these two theories?” You can quantify this mathematically with something that’s called “information gain,” and it turns out that once you have a good probabilistic predictor, you can also turn this into a good estimator of how much information you would gain if you were to do this experiment or that experiment. Now, you could build an agentic system on top of the Scientist AI, for example, that would tell you which experiment to do in order to obtain good information gain. But of course, you could also use a guardrail. So you would like experiments that help to disambiguate between explanations and theories at the same time as not harming people. But that’s easy to do in the Scientist AI, right? We have this guardrail notion. So the user goal here is to acquire information; the safety goal is don’t harm people. (Of course, this is cartoon.) And so you could get both. But of course, you now enter into the realm of agentic systems, and the whole plan of the Scientist AI includes how we can develop agentic systems on top of a non-agentic, trustworthy predictor. Rob Wiblin: I guess if we wind back a year or two ago, we had AI models that were, in a sense, extremely knowledgeable, extremely smart. But if you just tried to get them to navigate a web page, they would struggle to do it. It seemed like there was a very big difference potentially between scientific intelligence or ability to predict things versus ability to navigate the world in practical terms. And it took a lot of extra training, a lot of extra effort in order to get them to be able to take useful actions. Do you worry that the Scientist AI that you might train using the kind of data that you’re imagining would be kind of incompetent at a practical level? I guess unless you did a lot of this sort of work, where the experiments we were running were like, “Do you click this button on a particular web page?,” it wouldn’t actually learn to do the things that people want the models to be able to do. Yoshua Bengio: You could absolutely train it with trajectories of what happens when a particular agent did this, what was observed, what the consequences are. This is how it would learn to learn good conditional probabilities of what actions to do in order to achieve particular goals, including the safety goal. So it would be a different kind of training than the reinforcement learning training, but it would be using the same resource, so whatever experience has been collected — by the way, it doesn’t have to be what people in RL call “on policy”; it could use the experience of any agent or anything that is observed. That is, not just agents doing things, but just observing things in the world. All of that is data, as far as it’s concerned, that helps it both understand the world and construct the consequences: “What are the consequences? What would happen if I do this action? And what action will maximise the probability of achieving some goal?” In a way, it’s closer to model-based reinforcement learning, where you are able to use your whole experience to come up with a policy by opposition to something that is fully interactive all the time. In the Scientist AI, currently you would need to retrain it or fine-tune it with the new data if you use it and it produces new consequences and new observations. But we can ride on the same research that companies and academia are working on, on what’s called “continual learning.” So what happens when there’s new information coming in? Of course you could put it in the context, like in the input window, and with the Scientist AI you could do the same thing, but at some point you’d like it to be integrated somehow into the weights of the system. And that’s what continual learning is trying to do. But the good news with the Scientist AI is it’s facing these same problems that current AIs are facing, and the solutions that are being explored will be applicable to the Scientist AI as well. Rob Wiblin: Yeah. I feel like a lot of the critiques of this idea, and my questions somewhat reflect that, is that people, including me, have had in mind a vision of an AI that’s extremely different in how it’s trained, maybe the data that’s being used and the structure and the affordances that it has. And you want to say, actually we can make it remarkably similar. We can take almost all of the data that we’re using to train current LLMs and just reformat it a bit and then use it again. We’ll use all of the different efficiencies, all of the algorithmic improvements; we’re just going to give it a somewhat different set of inputs and outputs, but it’s more or less the same in almost every other respect. Yoshua Bengio: Yes, yes. Rob Wiblin: That’s why it’s so practical. Yoshua Bengio: Yeah, that is why I think we can do it pretty quickly. It’s more a matter of having the right resources for training. And yeah, because the training objective is different, we do need to try it out and see how it works. But fundamentally, it isn’t so different, for example, from maximum likelihood training, which is what we use in pretraining. So in a way it’s closer to the pretraining — and we know that works really well, by the way; it’s actually working better than RL, which is harder. So the form of training here is much closer to what we do in pretraining, except that we teach the AI the difference between what people say and what it actually believes, and we force it to reason about why did people say those things rather than imitate what people would do. Rob Wiblin: The last I saw online, the organisation that you’re leading, LawZero, has raised something like $100 million. I think most nonprofits at year end would be pretty happy to have raised $100 million or so, but I guess you’re up against organisations that have $100 billion. Yoshua Bengio: Actually it’s even less than that, but more depending on how you count. So we’ve raised about $35 million US from philanthropy, but we are in negotiation with various governments to get much more. So we are pretty confident that we are going to be in the hundreds of millions pretty soon. But as you say, that’s still peanuts compared to what the leading AI companies have. However, I think it is sufficient to make a proof of concept, and with a proof of concept then we can convince companies to actually put in the money to train larger systems, or systems that are trained from scratch using the same principle. Rob Wiblin: So that’s the theory of change. What sort of experiments do you want to run, and how much money would you need for them? Yoshua Bengio: There are various kinds of experiments. The bottom line is we want to show the improvement in honesty and basically getting rid of deceptive behaviour. And we can do it with two categories of experiments. We can train really small models — of the kind that academic organisations have been training, like less than 10 billion weights or something — from scratch, but using the Scientist AI objective and the data representation that I mentioned. That won’t be competitive, because it’s going to be much smaller models, but we can compare it head to head: the original open weight model that has that same size and trained on the same data, we can compare both in terms of capability and safety, essentially, at least honesty being the main thing we’re looking for. So that’s one kind of demonstration. The other demonstration — which is maybe closer to being deployable, but has less guarantees — is to take an existing pretrained model, maybe starting from a Bayes model rather than the one with RL, and then fine-tune it using the Scientist AI objective and data representation. That would give a much more competent model because it’s bigger. Fine-tuning is much cheaper, as you know, than training from scratch, but we lose the mathematical guarantee. I think it’s probably going to be fine anyway. Of course, it depends how much fine-tuning you’re willing to do. What’s interesting here in these kinds of experiments is that we should be able to see the tradeoff. Like if you measure, say, on deception benchmarks what happens as you continue training with more and more fine-tuning, we should see a curve: that it gets better. And that’s what we’re hoping to see. Then you also want to show that capability doesn’t go down — which, by the way, is tricky, because unfortunately what we found already in our experiments is that at least most of the open weight models cheat on the benchmarks. So what do I mean by this? As soon as you do a little bit of fine-tuning on anything, their performance on the benchmarks goes down. Rob Wiblin: I see. So they’ve really taught them to the test? Yoshua Bengio: They probably have overfitted the benchmarks. So we need to find a way around that, but I’m confident this can be done. So we will have these two kinds of evidence, I hope, and that may be sufficient to convince people to put in not just hundreds of millions, but the billions that would be necessary to do a full-scale model from scratch. Rob Wiblin: Comparing like for like, if you train a model of the type that you’re envisaging versus a standard model using the same amount of data, same amount of compute, I guess we think that the Scientist AI would be more honest and safer. In terms of capability, both in terms of prediction and agency, would we expect it to be better or worse? And how much better or worse? Yoshua Bengio: In terms of capability, I would expect it to be better because of better reasoning. One aspect I didn’t mention yet is that there’s good scientific evidence that when a model exploits the causal structure of the world, it can generalise better out of distribution. This is something I’ve worked on, and many people in the machine learning community have been working on. It has to do with a very interesting concept that the world changes, but somehow there are things that don’t change — like the underlying causal mechanisms, like how the world works, like the laws of physics: they don’t change. So even though the distribution of the data may change because things happen in the world and things will look different on the surface, the underlying scientific explanations for how things are are the same. And if you can train your system so that it is encouraged to discover these explanations, and the system also understands the notion of intervention — in other words, when somebody does something in the world, it can change the distribution, but it doesn’t change the mechanisms, the underlying laws of physics — when your model is able to make those distinctions, then it’s going to be much more robust to changes in distribution, which is the hard question for neural nets and machine learning in general that, for now, we don’t have good answers to. And in the world of safety, this is a real issue, right? We would really like our guardrails to be robust to the fact that the world is going to change, the distribution of the data is going to change. They’re going to be asked questions that are very different from what they’ve been trained on, so having systems that can generalise better out of distribution because they understand the causal structure would be a huge plus. Rob Wiblin: Is one way of phrasing this that current models, as we train them, are designed to predict what people would say, and they learn to understand something about the truth as a side effect, as an instrumental part of predicting what would be said? Whereas your models, they’ll be primarily oriented towards figuring out what is true, and how does the world work, and then they would learn to understand what people might say as a side effect of that, incidentally? Yoshua Bengio: No, because we don’t have enough ground truth about what’s really happening in the world. I mean, there is scientific data and evidence, but the Scientist AI would use mostly the communication acts, like what people say, as a source of information about people and society. And that is a very rich source of information. The problem is that you can’t just believe what people say and repeat it. Current LLMs, if they see something false very often repeated, like the Earth is flat, if it was repeated enough, they would start saying it, right? Rob Wiblin: Is that definitely true? Because it seems like they don’t buy into conspiracy theories that much currently. They don’t say that the world is flat just because many people do. I don’t know. I mean, there’s other examples, but by and large they reject conspiracy theories. Yoshua Bengio: If they understand conspiracy theories, and they are not playing the persona of a person saying it, you’re right. But there’s lots of other evidence, not for conspiracy theories, but all kinds of biases. These biases would be not something a small number of people believe, but more like most people believe something wrong, which induces discrimination, for example. And there the evidence is very clear: the current LLMs are biased in the same general way that the population is biased right now. And the Scientist AI wouldn’t be falling for that as easily, because it would look for both what is a good explanation for what people are saying and that this explanation has to be coherent with all the other things that it knows or that it has seen. Rob Wiblin: I feel like the discussion about this proposal last year got more focused on the mathematical theoretical guarantees, discussion of the safety guarantee side. It feels like you’re moving, and it seems like I feel like we should move, towards a scrappy 80:20: this is going to probably be safer, we have good reasons to think it’s better, so let’s just throw something out and see how it goes and iterate from there. Do you agree? Yoshua Bengio: I agree. But I also think it’s very important to use the theory to guide us in making the right scrappy choices. For example, in the math for the Scientist AI, we can see some requirements — for example, not using reinforcement learning to learn how to make good predictions; in fact, stronger than that, making sure that the way it’s trained doesn’t get any signal about what would be the consequences of its predictions. They may seem like algorithmically they’re very small changes to how we would train a predictor, but they give us the guarantee, so we might as well use those particular requirements that come out of the theory. I think that the part about being scrappy is more because of the cost of training large models and the engineering has to be efficient and all these things, and we should be willing to cut corners on that. In our plan, this is why we are prioritising the non-agentic predictor that can be used as a guardrail which would already mitigate some of the issues, and doesn’t require a big overhaul of the systems that currently exist, but just is an add-on. That’s much more likely to be adopted than something that requires a lot of investment — not just because of training the models, but because people are kind of focused on this current recipe, and there’s so much competition between companies that it’s very hard for the companies to allocate even attention. It’s not even money; it’s attention to a slightly different way of doing things. Rob Wiblin: I think for many people who are less into AI or computer science, a concern that might immediately jump out at them about this entire proposal is the idea of we’re going to build a database of things that we I think are verified facts, that are ground truth, that we’re going to be aiming at — because it’s the kind of thing I feel would give people who are trained in the humanities a bit of a heart attack: the idea that we have some corpus of things that we’re absolutely sure are true. In some philosophies, there’s nothing that we’re really confident about. Or at least in the areas that we’re most interested in, things seem highly contested and uncertain. Is that a big problem for the proposal, or is close enough good enough? If we mostly put in things that we’re mostly confident about, then it kind of approximates it and it can see through any errors in there, as long as it’s not massively, systematically biased in what you’ve put in as verified? Yoshua Bengio: Yeah, I’m pretty sure that a small percentage of error is not going to make much difference. But also, there are guaranteed truths that are easy to obtain. And by the way, it is the same data that is currently used to train those systems to reason. So mathematical theorems for which we have the proof — and I mean proofs in Lean or something like this, where they can be verified. And the most important source, actually, is computer programs. So we are currently training the frontier models to predict what the consequences of running a particular program would be. So they basically understand programs, and that is all like hard facts: you take a program, you run it, you get some output. An AI that understands programs should be able to predict what will come out, and these are not contestable. Rob Wiblin: Yeah, but we’re kind of more interested in the social world, I would think. Yoshua Bengio: Totally, totally. But what I’m saying is there are pretty easy sources of hard facts. There’s another source, which we have to be a bit careful and maybe use a different kind of syntax, which is scientific observations. There’s a lot of scientific data out there. Scientists share their data. So it is a hard fact, but it is a fact about an observation. Of course, the observation could be noisy or maybe even the experimenter could have cheated. There’s a bit of noise there, but that’s fine. It’s something we can say that has been observed. And you’re right: the most interesting questions we care about are the questions in domains that are not these — not scientific or not math and computers — and for these we only get communication acts. But the Scientist AI training procedure is going to force the part of the system that produces explanations, called the “explainer,” to come up with explanations that use this factual syntax rather than the communication syntax for the explanations of communication acts. So if somebody makes a claim, and you observe that somebody made a claim, then one of the pieces of explanation is going to be that the claim is true or not. It’s not like the Scientist AI needs to commit on whether it’s true or not, but it needs to commit on what’s the probability that it’s true or not as part of how to explain this. And this will force the neural net to learn about the underlying explanations that are factual, even though it’s not sure about them, so it learns the syntax and the semantics of statements in domains where there’s no ground truth. Now you might say, if there’s no ground truth, how do we know that these are real or not some made-up stuff? And that’s because the most predictive models, as we see in science, for example, are the models that are expressed using actual properties of the world. The way scientists build explanations about the world isn’t by combining statements of the form “somebody said this causes this to happen.” Now, in between those causal relationships, there will be latent variables that we don’t observe directly — like what did the person actually think, or what are the intentions of the person, and what kind of person is receiving that communication? These are actual properties of the world — they’re not communication acts — and the causal connection is happening at that level. All scientific theories are about actual properties of the world and how they’re causally related to each other. And there’s a reason for that. Mathematically, when you express your explanation for the world in the language of what’s actually going on in the world, rather than what people say, you get better predictions. Rob Wiblin: Yes, this is something I’m very unsure about. Could you train a Scientist AI of this type with no verified claims in the database? Yoshua Bengio: No. Rob Wiblin: You can’t. It has to have that. But we think that current models, which don’t have this structure where there’s verified claims that they’re predicting, nonetheless represent truth internally because that’s useful to the thing that they’re doing. But in this case, it doesn’t work that way. Yoshua Bengio: No, but it’s not enough to represent truth internally. It needs to learn a language that we can query about what it thinks. So the main reason why we need these verified truths isn’t because of whether they’re true or not. In a sense, who cares about whether some theorem is true or not when you’re talking about human psychology? Why would it matter? The only thing that matters is to teach the AI the syntax of how to express actual properties of the world by opposition to the syntax of to express “somebody said something.” And the reason we want to teach that syntax is that we can then later query using the same syntax but on a different kind of statement, which are the statements about people and politics or whatever. Rob Wiblin: So what we could do instead is we’ll put in a whole bunch of verified facts in maths and computer science and I guess the hard sciences, and then in areas like geopolitics and psychology, maybe there’ll be very few verified things, but at least it has the concept of verified things versus statements. And then it will port that across, and I guess it will assign credibility to different sources? It will learn to have some sense of who’s truthful versus not, and then try to generalise out of distribution into these other areas? Yoshua Bengio: Yes, yes. And it will use the coherence of different hypotheses about actual truth: how coherent is a particular hypothesis with all the other hypotheses that the system has about the world? Just like a scientist would: if somebody comes up with an explanation for something, and that explanation is not coherent with other things we believe strongly because of other evidence, then we’re going to reject that explanation. The same thing is going to happen. In its training procedure, it is not just trained to predict the next thing — that would be like just autoregressive predicting of what’s in the data — it’s also trained to be internally coherent.; those explanations have to be coherent with each other. Rob Wiblin: So imagining a model where we’ve trained it on lots of verified facts in hard sciences, where we feel we’re on stronger terrain, I guess it learns it wants parsimony, it wants good sources, it wants coherence. I could see that generalising well out of distribution to other areas like psychology, or I could see it completely falling apart. Do we have a sense of whether it would generalise well to other areas? Yoshua Bengio: The way in which it could fail is by basically feeling it doesn’t have enough confidence about a question. Rob Wiblin: So it could just start answering “I don’t know” all the time. Yoshua Bengio: Yeah, but you have to understand it’s not actually saying “I don’t know”; it’s producing a number between 0 and 1 that is a probability that something is true. And in fact it’s also producing a confidence interval around that number. So it could be that in some domains there’s just not enough information in the data that it has seen, or it maybe wasn’t trained long enough to deduce good theories about that domain. And at the end of the day, as a consequence it’s going to answer with a probability that is far from 0 and 1, far from full confidence. But that’s what we want. We want that kind of epistemic humility and honesty, because when it gets to really serious safety questions, we’d rather have something that says “I don’t know” when the real thing is it doesn’t know than the sort of thing we currently see with frontier models: often they will have excessive confidence in their answers. Rob Wiblin: You said that you think the Scientist AI actually might be more capable, because it’s more trained on actually understanding the truth. I guess I’m a little bit sceptical of that, because it seems like if that were true, the companies would be more invested in this approach. They’d be just throwing more money at it, having more people work on it. Do you think they’re just making a mistake there? Yoshua Bengio: I don’t think that they really understand what I’m doing — and to their credit, I haven’t put out the math yet. There’s another factor that may be at play here, based on the discussions I’ve had with people inside the leading companies, which is they’re so focused on short-term survival — as in, continuing to compete — that they put all of their attention, the ‘code red’ sort of thing, into small incremental changes to the current recipe. Considering a different recipe would be an investment — not just in money, but in people and code. Right now they could do it, they have the money to do it — but it’s more like a mental focus, I think, that is going on here, that comes not because of bad will but because of that competition that is very fierce between the companies. Rob Wiblin: So there’s a sense in which for one of the leading companies — like Anthropic, OpenAI — it’s not very attractive to make a bet on this, to divert 20% of your staff onto this, because if it’s a bust then you would fall behind basically your main competitor. For a company that’s currently way behind, that feels like it’s losing on the dominant paradigm, there’s a certain attraction to making a bet on something very different, because it could suddenly leapfrog you ahead if it turns out that it’s a massive success. Do you think there’s any chance of convincing one of the companies that currently feels like it’s not doing too well within the current LLM agent paradigm to make a bet on this very alternative method? Yoshua Bengio: It’s an interesting way of thinking about it. I think what you’re saying is plausible. Rob Wiblin: Not clear what the candidate company maybe is? Yoshua Bengio: I actually think there’s a related possibility, which goes maybe more to policy questions. The context here for me is: what kind of future is going to be stable, and not turn into a global dictatorship driven by AI and excessive concentration of power, in addition to avoiding catastrophic loss of control and catastrophic misuse and all those things that can come from very powerful AI? And I think that because of the game theory dilemmas — basically prisoner’s-dilemma-style problems that make companies and countries go and make decisions that are the rational ones, but that are globally bad, like basically cutting corners on safety and the public good in order to stay in the race — because of this, it would be much better if we ended up in a world where the power of controlling very strong AI is not centralised in the hands of one or two companies or one or two governments, but is instead distributed. So what do I mean by this? How do you make sure no one person, no one company and no one government has too much power? Or all of the power, in the extreme case? There’s a very old idea: it’s called democracy. That’s what it’s about. I don’t think that our current democratic institutions are robust enough to deal with those changes, but the principles are there. To be maybe more concrete, imagine that you had a coalition of countries which together decide to develop AI safely and for the benefit of humanity and not to dominate each other. That will be a much better and safer world, because you break this competition problem that we are currently locked in. Now, what that would mean is the control, it could involve companies, but on top of the companies you need to have representatives of the people, like governments. And you don’t want a single government, because a single government can be corrupted by power and the power that AI can give, right? So you want something maybe like a coalition of governments who make a treaty about those things, with verification, for example, so that even if they don’t trust each other, they can prefer the treaty than no treaty. The reason I’m bringing this up with respect to your question is that I think it would be a better world if it is a bunch of governments which fund the most advanced AI systems. I mean, they could work with companies, of course, but I think ultimately we would like the decision power to be at the level of governments, but not a single government, because then we’re back to grabbing the power. If you have 10 governments working together and no one can really have complete power, then even if there is a bad apple, the collective decision making is more likely to be robust to this sort of event. So these kinds of coalitions would be interested in developing AI that can leapfrog the current methods and provide safety, because safety is a public good. And in fact, in the case of AI, it is a global public good, right? It’s not just something we can solve locally in each country. Rob Wiblin: That makes sense. I think a lot of people are wary even of the multilateral government idea, because you’ve brought together 10, 20 governments, and they could potentially coordinate together to oppress the rest of the world, to start with. It’s possible that one government inside that coalition might end up seizing control later on. It’s also possible that governments don’t fully represent their people: you could have those 20 executives basically take power and then oppress their own people as well. So it’s not completely obvious that it’s better than having a company do the best that they can, because at least they don’t have their own military yet. Yoshua Bengio: You need to make sure the contract between those countries is clear on the mission and the commitment that the countries are making. Ideally this would start with democratic countries that agree on the value of doing things for the public good, including the benefits of AI, so that it would be robust to these becoming, as I said, bad apples. Or even at some point, that circle should grow to the whole world, including non-democratic countries. But you want to be able to set the rules of the game in a way similar to what were the hopes of those who designed the UN, for example, after the Second World War, with the kind of general principles of human rights and sharing the power basically — which we’ve lost by the way, and maybe has never been effective, as my prime minister Mark Carney has been saying. But that’s the only kind of world in which AI isn’t going to be turned into an instrument of power or domination, or that we end up with crazy risk taking because of the competition. So we need to escape the game theory bad scenario of competition that we are in. And we need to make sure that it doesn’t end up in the hands of a single player who can abuse that power. And I’m not saying that there’s a guarantee that this would work, but it’s, I think, a good plan to strive towards something like this as a way to achieve global safety and beneficial use of the technology. Rob Wiblin: It’s an interesting thought that it seems very difficult for a coalition of countries like Canada, UK, EU, Australia to compete with the big three companies at their own game. But maybe they would have a shot at coming up with a different paradigm that’s superior, and making a bet on that — one that they think is safer and potentially also more capable — that those companies are not even currently attempting to really pursue. Yoshua Bengio: Yeah, absolutely. And I would add two things to this. One is the safety component of AI systems is probably going to become a more critical piece as the technology continues to move forward. So the countries that have access to technology that provides greater reliability — Rob Wiblin: Might actually be able to deploy it more. Yoshua Bengio: Also they would have a card to trade at the international level in some ways. So let me maybe share some of the words that Mark Carney presented at the last Davos. He said, talking about the geopolitics and countries, that either you are at the table or you are on the menu. So he was saying middle powers need to get together to make sure that they will be at the table, otherwise they can easily be eaten alive by the “hegemons,” as he calls them. And that is I think interesting, because it kind of forces a situation of distributed power if you have a coalition of countries that could have leapfrogged, or particular cards like safety in their game — so that they can actually negotiate as equals, let’s say, at the table. Rob Wiblin: Let’s say that the Scientist AI was putting in the same amount of compute and data, and it was less capable than the models that we have now. Could there potentially be a commercial market nonetheless, if it’s a lot safer and more reliable, less likely to take crazy actions for high-risk applications? You can imagine in the military, in banking, I think there’s lots of businesses that are somewhat wary to roll out the agents that we have today because they just can’t be relied upon consistently. Not in places where you can cause disastrous actions. Could you see there being a niche for this kind of model commercially for that kind of reason? Yoshua Bengio: Yeah, and it would probably be in those domains that the early versions of the Scientist AI would be deployed, because that’s where there is the most demand for this kind of thing, and where the tradeoff between capability and safety — if there is one; I don’t think there really is, but we have to build it — would not hurt too much the commercial viability. So yeah, they would be natural places. But I think that as agents are deployed more and more in our society, the reliability of those agents is going to become a crucial selling point, so there will be more pressure for companies to incorporate these kinds of guardrails that people will trust for scientific reasons. Rob Wiblin: We have a lot of people in the AI industry and philanthropists as well in the audience. Do you want to give them a pitch for potentially working at LawZero? I don’t know whether there’s other organisations that have similar ideas. But I guess also potentially for [supporting] it financially? Yoshua Bengio: The more strong people technically help with LawZero and its Scientist AI programme, and the more money we can get to make that go fast, the more likely we get this positive impact that we are after. So there is a real advantage to converting those — for now — mostly theoretical ideas into something that can impact the world. And we think we already have a good start, but it’ll be much more likely that we end up in a good place fast enough if we have more researchers and research engineers. And we are particularly interested in people who do care about the mission enough that they want to dedicate themselves to really make it happen. And on the philanthropic side, it’s the same thing: we want people to make a bet because they care about the catastrophic risks, and they want to encourage one path that at least has promising theoretical guarantees. And unfortunately, I don’t see many other paths except the cat-and-mouse approach that is currently followed by the companies. And given the consequences of not finding a solution could be huge, I think we need to diversify and have these kinds of investments. Rob Wiblin: If you saw a significant increase in the interest in the project among the most capable people, the best people in AI, and you had an influx of financing, what sort of stuff might you be able to accomplish over the next three, six, nine, 12 months? Yoshua Bengio: The short-term thing we are planning to do is to put out what we call the “contextualisation pipeline,” which is the data processing — which, by the way, doesn’t require humans to identify what is a verified truth or not: we only need to look at the data sources individually. Is this a source that we consider verified? And what category, what syntax could we use for this? But that’s a decision that can be made by engineers not at the level of individual statements, but at the level of the whole database or whatever. The second thing, of course, is a smaller-scale guardrail or a guardrail obtained by fine-tuning an existing open weight model. That could happen quickly, depending on how many people we get and how fast we’re able to deal with the engineering issues. So these are the short term things. And of course, to get the strongest guarantees, we want to advance the agentic Scientist AI version fast. But we are also conscious that that’s the most ambitious one, and might take more years than months. Rob Wiblin: Reading the output from the main companies, I get the impression of just like an absolutely frenetic pace, and an incredible degree of focus on just advancing the frontier models. I’m slightly concerned that even if you did have very good experimental results that came out in the near future, I’m not sure they would even have the capacity to pay attention and to reflect on how that could affect their plans, or how maybe the kinds of models that you’re training could be a useful additional monitor. Is there anything you can do about that? Do you share that concern? Yoshua Bengio: Yeah, I do. I think they could, but they might not pay attention. I think the best thing we can do is to provide sufficient evidence for them to pay attention. Yoshua Bengio: Also, in addition to my technical work, I’m trying to improve public understanding and policymaker understanding of the greatest safety risks, because I think that will play a role in their decision. If the public becomes more concerned about safety, then there will be direct and indirect pressure on the companies to maybe allocate more of their resources to this question. If the public is concerned, then governments will be more likely to regulate or provide legal incentives, for example through liability. Maybe make them consider the safety investments that would be needed to scale the sort of thing I’m proposing, for example, as profitable even in the short term. I think in general for the safety issues, there are psychological barriers like cognitive biases that prevent people from being totally rational about what’s going on. That’s true in governments, that’s true in the general population, and that’s true within companies or even within academia. There’s all sorts of reasons that might explain why we’re not collectively taking the right decisions. So there’s the game theory aspect, but there’s also individual psychology. For example, we all want to feel good about our work, which means we’re going to maybe be biased towards thinking our work is going to be beneficial rather than harmful. And that’s going to be true of people in industry. That’s going to be true of people even in academia working on AI, because they want to feel like their work is going to bring a better world, not destroy it. And there are other factors, like some of the factors that we see with the attitudes regarding climate change: if the risk isn’t something that is in your face — you look outside and you don’t see catastrophic climate change, you don’t see robots killing people — you don’t think too much about it. You are much more concerned about your immediate worries. So I think that’s the real challenge. If we can improve the understanding at a gut level that people have of the magnitude of the risks that we’re taking collectively, things could change, and they could change pretty quickly. If you think about how quickly governments shifted their actions in a radical way after the beginning of the pandemic, you can see that they can move quickly when they take an issue seriously. And that usually is going to be driven by whether the people take the issue seriously. Rob Wiblin: Yeah. My impression is that the people at the companies are both pretty happy and impressed with their mundane alignment techniques, how well they’re going — but also appreciate that in a sense, they’re losing control, or they’re losing the safety guarantees that they used to have, because the models are going to be much more capable of potentially outsmarting them and are much more evaluation aware and so on. So in a way, they’re both satisfied with what they’ve done and also scared, I think, of what is to come. And that does create an opening for you. Yoshua Bengio: Yeah. I would bring here a very important aspect of the whole discussion about safety and catastrophic risks: there is uncertainty. In other words, we don’t know how things are going to unfold. We don’t know if the game that the companies are playing now in terms of safety is going to be sufficient. But if they fail and we continue with capability advances, then the consequences could be really terrible. So even if we don’t know the probability of some catastrophic event, we should apply the precautionary principle. What it’s saying is when you are in a situation where one action could lead into something terribly bad, but you’re not sure what is the probability for that — is it 1% or is it 90% or is it 0.1%? You don’t really know. And in our case, there is that kind of uncertainty, because you have respected people who are very concerned and other respected people who think it’s going to be fine. So if you’re in a driver’s seat and you’re faced with these different voices — even within the same person, they would one day say it’s going to be fine, and the other say this is maybe very dangerous — you should just bite the bullet: there is uncertainty about something potentially catastrophic, and then you should act with precaution. Which means you should invest a lot more in AI safety research in this case, you should invest a lot more in the incentives that would push companies to behave better with respect to the public good — just like we’ve done in other industries, by the way. But it’s important to really point out that we have to bite the bullet: that there is a lot of uncertainty. Rob Wiblin: And there’s going to continue to be. Yoshua Bengio: And it’s going to continue to be. Because it’s too easy, for example, for people who want to feel comfortable about the whole thing, to just listen to the voices that are reassuring. And in fact, we do it internally as well. So we just have to be honest that there is uncertainty, and the stakes are very high, so that should guide our decision making towards being on the precautious side. Rob Wiblin: So it seems like it would be good for the Scientist AI proposal, and I guess for our chances in general, if we could make things go a little bit slower — especially if we didn’t leap into fully automating AI R&D at the very first opportunity, which is kind of what it seems like we’re on track to do. What are your main requests for governments and for companies, in terms of buying us a bit of extra time to assess how these things are going and consider alternatives? Yoshua Bengio: For companies, I think they should invest a little bit more of their research into designing experiments illustrating not just the risks, but trying to undo some of the wrong beliefs that people have about AI. So let me be a bit more clear: a lot of people don’t actually believe that it’s possible to have machines that have goals that we didn’t choose. But that is the scientific reality now. There is no question. Rob Wiblin: I think you must have just not been paying attention to think that. But I guess many people aren’t. Yoshua Bengio: But the vast majority of people have a gut feeling that they can’t be conscious or some other excuse, or it won’t be possible to build machines like us. There are many things that people will say but actually don’t hold water. So I think there’s a real opportunity here to educate the public and the policymakers to realise that we are building agents that have their own goals — and right now, we can’t be sure that those goals are going to be aligned with what we want or go against our safety instructions. That’s a very simple message, but I don’t think — Rob Wiblin: Even that hasn’t broken through. Yoshua Bengio: The data — doing it well, doing it in a way that can’t be easily put into question — would help a lot in the public debate. And it has to be done in ways that the general public — who’s not an expert, who’s not going to read the system cards — is going to actually understand. Rob Wiblin: So there’s lots of examples of this kind of thing that would convince you and me. But I suppose people will dismiss them, saying that you can see maybe that it was a misunderstanding on the model’s part; it thought that we wanted X when we wanted Y. Or you can see how we did the training mistakenly so it induced this goal that we didn’t want it to have. Or I suppose they might just deny it outright in some cases. But are there any experiments you think that we could do that would be much harder for people to dismiss, even if they’re coming from a sceptical starting point? Yoshua Bengio: We need to set it up so that clearly the AI is not responding to a request, for example, to escape our control or do something bad that it’s not supposed to do. I think if the experiment is something that can be translated in simple words, simple analogies that people understand, it’ll be much more convincing. I don’t feel like I’m an expert on answering your question. Anthropic has been doing a lot of work along those lines, but I think all the leading companies should be investing in this, because it’s investing in changing the game. The problem is they’re in this competition game where they’re stuck, even with good intentions. And in order to change the game, they have to influence the understanding, which is biased and wrong right now, of the risks in the public. And policymakers are just like a representation of the public. Rob Wiblin: Yeah. I guess there’s all of these examples of the AIs doing crazy stuff, but often you can always say that it was just playing a role, for example. And I guess for you and me, we’re like, yeah, but it might end up “playing a role”: that’s how it could end up doing bad stuff. Or this is a demonstration of other failure modes that we’ll see later on — that we just, in general, don’t have a full grip on it. I suppose it’s so hard to get people to have to believe something that they really don’t want to believe, or that seems incredible to them. Yoshua Bengio: Yeah. So I think that’s like real research. It’s a real challenge. That isn’t where I’m putting my energy, because I want to get the Scientist AI out of the door as quickly as possible — but I think it should be a priority for people in AI safety, working in the companies or in academia, to think about how to do these experiments so that they will be convincing. And by the way, the more capable the AIs become — Rob Wiblin: Maybe the easier this task is going to become. Yoshua Bengio: Yes, yes. Rob Wiblin: Apart from Scientist AI and this, are there any other top requests that you have of people in the companies, or is there any common practice that you think is particularly crazy that they should maybe cut out? Yoshua Bengio: Yes: Please don’t use an untrusted AI system to design the next generation of AI systems. This is the most crazy, dangerous bet that unfortunately we are on track to do. And keep in mind that, as is now scientifically clear, these systems are likely to know that they are being tested. So you might think that AI is honest, you might think that the AI is not deceptive, you might think that AI is aligned — but maybe it’s just pretending, and it’s going to be very difficult to know. And we should do our best to try to figure it out, but we should put the bar really, really high before we allow an AI to design the next version of AI, in terms of are we sure it’s not being deceptive? Rob Wiblin: Yeah, I think we’re currently on track to start on fully automated AI R&D and have the companies be saying, “We got the AI to monitor itself, and it didn’t flag anything. And that’s why we feel pretty good about this.” I actually think that is like the most likely outcome. I guess we’ll see how that goes. Fingers crossed we can do better. But earlier on you were talking about, as you’ve become more optimistic, in a sense, that we do, at least in principle, have a solution to the control problem, you’ve become more worried about the human concentration of power stuff. Do you have any suggestions, any policy ideas here? Actually, is there anything technical we can do here, or is this primarily a policy and politics question? Yoshua Bengio: Well, there’s a connection between the technical safety work and the policy safety work, in the sense that if we can demonstrate the existence of AI systems that would be competitive, capable, and safe, it’s going to be easier for government to impose the requirement that you have to show that your AI system is going to be safe in a way that independent scientists will say yes. Right now a lot of the governments are focusing on economic competition driven by AI, and that makes them also blind to the risks. So that’s where technical safety can help: it’s going to be easier to say that we can have both safety and competitivity. On the pure policy side, I think the biggest challenge right now is how do we get countries to agree with each other, in spite of the competition, including very strong distrust and disagreements on the political foundation. And that’s a place where we also need actually more technical research on verification methodologies that could be at the basis of treaties between, say, the US and China, which don’t trust each other. There is not enough research going on there, but a lot of people are starting to think about this, and think it’s quite feasible to change some of the programming or even the hardware to make these kinds of verification reliable, and we should do more. Governments should realise that if they want to end up with a treaty that they would sign, they need to incentivise that kind of research as well. Also, governments need to understand how transformative AI will be. I think a lot of the wrong thinking in many governments — I’ve been around the world talking to many different governments, like at least a dozen in the last year — the biggest mistake is to view AI in the future as if it was just a slightly beefed-up version of the AI we have now; and then focusing on AI as a normal technology that they would compete with other countries; and focusing on deployment because you get more productivity, for example, and not so much on the risks. In great part this is again because people in government, just like most people, don’t really digest the idea that we are on the verge of creating entities that can compete with humans and that could become tools of absolute power in the wrong hands. I’m not saying it will happen, but even if it’s only a 10% chance that capabilities rise to that level in the coming years or whatever, this should completely alert politicians that they have to do something about it. But the fact that they’re not doing it tells me that they haven’t yet integrated that scientific reality that we are on track. We see already on a small scale the progress towards these kinds of machines, so they need to wake up from their old mental constructs of seeing technology as mostly from an economic perspective, or even giving them a military advantage, and not realising we’re opening a Pandora’s box with incredible unknown unknowns of magnitude of impact, both positive and negative, that is very hard to anticipate. So that’s where I would ask governments to start reading more, listening more, and just spending a bit more attention on understanding what is going on with AI, where it is going, and what this could potentially mean. Rob Wiblin: You spent a lot of time talking to governments over the last couple of years, people in governments, but it seems like, by and large, they are not troubled primarily about the stuff that you and I are concerned about, but certainly not about loss of control as a key focus. Have you gotten any leads on what are the best things to raise, the best experiments to talk about that actually get people to think of that as a top-tier concern rather than a secondary or tertiary concern? Yoshua Bengio: I wish I had the answer to this, but I can say a few things. One factor when thinking about which arguments work is how much time you’re able to spend with the other person to explain those things. If you’re going to just talk to the public at large through a few messages, you won’t be able to change their mind very much on the foundations of their beliefs about humans and machines, for example. The only way you can catch their attention is to talk about things that they are already preoccupied about, close to their immediate concerns — like jobs, like the effect of deploying AI on children, and things like this. We can see that this is something that has emotional valence for many people, so we do need to talk about those things. But of course we may end up with regulation or government intervention that deals with this, but doesn’t deal with more serious problems we’ve discussed. And for this, unfortunately, it takes more work. It’s not enough to just write a paper in a newspaper or something like this, or even be interviewed at the evening news — because I’ve done these things. Where it’s working is when you can spend enough time almost one-on-one with a person, like hours, so there can be a dialogue where you can show them that their preconceived ideas actually don’t hold water — that there is data, that there is evidence that these can be really dangerous. But it’s not something that happens quickly and easily, unfortunately. I mean, there are exceptions. There’s somehow a minority of people who get it quickly, but the vast majority doesn’t. Rob Wiblin: Yeah, for what it’s worth, I think there is an experiment that was done a couple of years ago, where they presented a random sample, I think of Americans, with many different essays basically explaining the control problem with many different angles and focuses. And they all worked reasonably well if the person read this substantial block of text. And they all worked about equally well, which is interesting, the many different angles. It was kind of just an exposure effect of actually sitting down and thinking about it for some period of time. But I guess it’s hard to get people to spend a lot of time thinking about this, especially if you’re asking for the whole population to do it. Yoshua Bengio: There is a sense in which things could get better quickly. If we are able to catch a little bit of the attention of people, then they will read more or listen more to the discussions around AI and the risks, and then it could feed itself. If you’re concerned about something, you’re going to read more about it, and now you are entering into a phase where you can digest more of the things that go against your prior beliefs about humans and machines, for example. Rob Wiblin: I guess events may draw a lot more attention to this problem, for better or worse, but I suppose the window between people paying a lot of attention and when big decisions have to be made might be quite narrow, unfortunately. Yoshua Bengio: I often get the question, “Are you optimistic or pessimistic?” — both about the choices I’ve made in how I spend my time, but more generally about our future and the risks with AI. And my answer is always that it doesn’t matter if I’m optimistic or pessimistic — actually, I’m a naturally optimistic person — but what matters is whatever each of us can do to shift the needle even a little bit. And for most of us, it’s going to be a little bit. Each of us has some skills or something to bring to the table. I’m a machine learning researcher, so I’m focusing a lot of my energy on this, on how those skills can be put to use here. But every individual citizen, especially in a democracy, can influence the government. They can talk to each other more about it: that’s how you start thinking through and questioning your own beliefs. You can influence your representatives and so on. This has worked for many other social issues and political issues in the past, and it can again. So yeah, we should go back to feeling good about our actions by choosing our actions towards shifting the needle, even if there’s no guarantee that it’s going to work. Rob Wiblin: Yeah, I’ve been worried about this issue for 15 years or so, and I’ve been working on it more intensely only the last couple of years. But I often find myself just feeling quite drained and exasperated and a bit exhausted — I think the main reason being just so often encountering I feel like people who are creating the problem who feel like they want to be wilfully blind to the issue. I mean, I guess being more charitable: it’s hardszx to understand; we’re all speculating about how things might go. But in my heart I often just feel like people are deluding themselves almost quite consciously, and just saying absolutely crazy stuff about how they think it’s going to be safe and things are going to go fine. And that’s just emotionally, frankly, quite draining. It’s almost difficult to maintain motivation when you’re fighting against people who are actively creating a problem where they could stop or take actions and have a lot more effect than you, I suppose, if they were willing to be more honest with themselves or be more thoughtful, and pause and really really reflect on what’s going to happen. Did you also have this experience? And how do you maintain your motivation in the face of what I guess I find very frustrating? Yoshua Bengio: Just going back to my previous answer, I was extremely concerned initially, and anxious and I was worried about the future of my children and my grandchild, who was one in 2023 when I started really focusing on this. But what saved me from all that anxiety is deciding I would do something about it. And by the way, you’re doing something about it, so you should feel good about it. Rob Wiblin: I feel good. But also very frustrated. Yoshua Bengio: Yes, yes. But you can turn frustration into questions, like the Scientist AI: why is it that people don’t get that these are crazy serious risks? And it is an activity trying to figure it out, which lifts somehow, at least for me, a lot of the heavy burden of thinking about what can go wrong. Turning from fear to action to avoid the problem, even if there is no guarantee, is extremely powerful. Rob Wiblin: I think the situation in which it’s most frustrating is when it feels like people are kidding themselves out of financial self-interest when they’re doing it, because they have equity in some company that wants to go very quickly. I have felt somewhat better noticing that many people who don’t have a particular financial stake in here — and indeed, would be better off by their own lights, in my view, if they were advocating for going slower — also don’t think that there’s a serious problem here. It doesn’t seem like the financial thing is the key predictive variable. It’s something else I think about how people reason about as-yet-unknown technologies. Yoshua Bengio: Yes. And I think there is another reason, which is sort of very basic psychology that has to do with just an unconscious movement towards thoughts that make us feel good. This is actually something that psychologists have been studying quite well. Rob Wiblin: That’s not universal, right? I find myself often drawn to quite negative thoughts sometimes. Yoshua Bengio: You can. But there’s this force, right? And it’s acting on a lot of people. For the most part, I think the people working in the companies that you’re mentioning, it’s not that they consciously make those choices that you think are wrong. It’s more like the brain works like this: that they will be biased towards feeling optimistic about how things will turn out, because that’s what makes them feel good about themselves, about their work. Now, I’m not saying this always happens. So why did I change my mind, for example? It’s an interesting question. Rob Wiblin: So back in 2019, I think you said to The New York Times that you thought worries about loss of control were completely delusional and fantastical. Yoshua Bengio: I didn’t say those words. Rob Wiblin: OK, no, what was it? They were “ridiculous.” I think that was the quote. Maybe that was just the Terminator scenario in particular. Yoshua Bengio: I think so, yeah. I rarely use words like this, but I know what I was thinking and the kinds of things I’d been saying. So at that time, I thought, first of all, the Terminator scenario is ridiculous. Time travel and stuff. Rob Wiblin: OK, yeah, the time travel. Yoshua Bengio: But also, it was clearly not reflective of the kind of actual risk. We don’t have robots, and even less in 2019. But more importantly, I think the main reason I was saying those things is I was hiding behind the belief that it would be so far into the future that we could reap the benefits of AI well before we got to that point. And why did I not pay attention, or not that much attention, to, say, the loss of control risk? I’d been exposed to it for more than a decade. I’d read some of the AI safety literature. In 2019, I read Stuart Russell’s book. I had David Krueger as a student. Rob Wiblin: He’s very, very doomy. Yoshua Bengio: He exposed me to these thoughts. But remember, I was actively working on making AI smarter. And you want to feel good about your work. That’s it. It’s not money. Rob Wiblin: Do you really think that was the reason for you? Yoshua Bengio: Yes. And now it’s interesting to ask me, why did I change my mind? So one way I like to think about this is something that the Buddhists say: to fight an emotion that somehow makes you do the wrong thing, just reason alone is weak for most people. You need another emotion that counters the emotion that pushes you in the wrong direction. And for me, the other emotion that’s very powerful is love, love of my children. I couldn’t live with myself with the idea that I would just go on after ChatGPT came out and not do something about it, because I felt like I couldn’t hide from myself the possibility that we were on track for something terrible. I knew that neural nets were, by construction, very difficult to control, and especially with reinforcement learning. So I don’t know why it works for some people and not for others. But really for me, it was an emotion that helped me counter the kind of unconscious drive to look the other way. Rob Wiblin: It’s very tempting to try to explain people’s disagreeing views by saying it’s like arational factors — like they want to feel good about themselves or their work. But I feel that there’s a mirror discourse on the other side, where they’ll say people like you and me have been deluded by science fiction, or we want to believe that our safety work is important. And I find that not credible and very frustrating and not persuasive when people try to attribute my beliefs to irrational. Of course, to some extent we’re all irrational, but when people are like, “You just read too much science fiction and you’re delusional,” I’m like, “No, I’m not. That’s not it.” So maybe even if I do have these beliefs about other people, I don’t expect it to persuade them very often. And I almost feel like you need to go out of your way to try to engage with the substance of what they’re saying, even if you think that maybe that’s not doing the heavy lifting. Do you have any thoughts on that? Yoshua Bengio: Yeah, totally. It’s a lot of work, but we need to take one by one each of the arguments that people bring up against acting with precaution. And it’s not very effective, but it is a necessary part of being honest about what we’re doing and honest with ourselves. So for a while, I was concerned, but I was hoping that somebody would have an answer for me that would reassure me. Rob Wiblin: And then you looked. Yoshua Bengio: Then I looked. Then I talked to people who thought it would be fine. And out of that came a lot of conversations that helped me build up the understanding of the arguments. And unfortunately, it didn’t convince me that we were fine, so I continued trying to work, but now more on how do we fix the problem? So yeah, I agree with you. And I think we also have to have the humility that maybe you and I are wrong. Like, maybe it’s all going to be fine. Rob Wiblin: There’s a substantial chance that things work out OK. Yoshua Bengio: Yeah, and I’m totally at ease with that possibility. In fact, I hope that we are wrong. But I think the honest posture should be: if we don’t know who’s right among the people who think it’s going to be fine and the people who think it’s going to be catastrophic, if people will just say, “OK, so there is that uncertainty. What do we do about it?” then the rational thing becomes clear: we need to do at least enough to mitigate the greatest risks. Rob Wiblin: Yeah. My best guess is that the cat-and-mouse game that Anthropic is playing is more likely than not to be sufficient to prevent catastrophic misalignment and loss of control. But better odds than 50% is not sufficient in my mind. I’m like, why can’t we get to 90% or 99%? And there it feels like we just are nowhere near having the really strong evidence or guarantees that we would need to feel that good. Yoshua Bengio: Exactly. I think there’s a big difference between 50%, or even 1%, that bad things will happen. And what I’m proposing with the Scientist AI, which is 99.999% basically, is this kind of scale of safety is where we need to be when we approach superintelligence. Rob Wiblin: I think I’m probably willing to run a little bit more risk than that, because there are other risks that AI would help us to reduce, right? So maybe like 99% would be… Yoshua Bengio: No, no. I’m only talking about deceptive behaviour. Rob Wiblin: I see. Yoshua Bengio: So it doesn’t solve the power concentration problem, which is why I’m also spending time on that. And by the way, collectively, I don’t think we spend enough time on that, and we don’t discuss it, but it has become much more important in my mind — because I do think now that there is a way technically to solve the problem of loss of control. The next biggest risk is AI dictatorship. Rob Wiblin: Yeah. And we’re a long way from fixing that. We’ve had a lot of coverage of that on the show over the last year. I guess it’s become a more salient issue. Is there anything that you would want to direct people to, who want to focus on that in particular? Yoshua Bengio: I think we should encourage the international discussions. Even though it’s true that the most important decisions are going to be in the US and China, and there are a lot of people in other countries who feel powerless and governments in those countries who feel powerless. But it’s a mistake. People outside the United States and China can do something about it. And the starting point is to understand the kind of discussion we’re having: that yeah, we don’t know what’s going to happen, we don’t know if we do something it’s going to help or not — but I think there’s a real chance it could, and we have to take those chances. Rob Wiblin: You made a massive shift in what you were working on in 2022 and 2023, going from focusing on capabilities to focusing on reliability and safety and so on. Do you think other people in AI who are more senior perhaps underestimate their ability to make a big career change and to switch their focus? I guess [Geoffrey] Hinton did roughly the same thing, right? Yoshua Bengio: Yeah. I guess it’s easier for people who are already established. I see a lot of my students who seem to understand what I’m talking about and kind of generally agree that this is dangerous, but in their mental decision-making calculation there is like, “What about my career, my family? I need to have a good salary.” Rob Wiblin: I feel like there’s reasonably good money to be made in alignment and safety and reliability work as well. Yoshua Bengio: But not as much. Rob Wiblin: Not as much, no. It is less, but good by any normal standard. Yoshua Bengio: Yeah, I completely agree with you. But there is a professional anxiety in machine learning students, which is kind of surprising. If I go back 10 years ago, or even 15 years ago, even before deep learning was something people talked about, the salaries for people coming out in my group with a PhD in machine learning were nothing compared to what we have now. But people were not as anxious about that somehow. I don’t know, maybe it’s a status thing. Because there are these crazy salaries, people feel drawn to this as they have to achieve that status even though they don’t actually need to earn millions of dollars per year. It is much more important, in my opinion, to think of what kind of world they will live in or their children will live in. But that is what’s happening. Again, it’s not rational. It’s human psychology at play. Rob Wiblin: Back in 2023, I think you gave a p(doom) of 20% in an interview. I haven’t seen a p(doom) that you’ve given anyone since then. Would you venture to say whether that’s gone up or down? Or are you staying out of the p(doom) game? Yoshua Bengio: I’d rather stay out of the p(doom) game. But let me explain why. It’s connected to my discourse about uncertainty that I keep saying: I myself don’t feel 100% sure that what I see as plausible is going to happen, but I do recognise that there’s a lot of uncertainty. So putting a number like this is making a big commitment about what’s going to actually happen, where we don’t have scientific data about how to calculate such a number. So I’m much more comfortable with saying, well, it could be small, it could be large, but that’s a large interval in which the probability is way too high for my taste and for the future of my children. So whatever it is, so long as it’s not 10^-20, I’m not happy, and I’ll do something about it. Rob Wiblin: Final question: Back in 2019, you’d heard the arguments, but you weren’t bought in. What would you say to someone who is still today where you were in 2019, who has managed to get through the rest of this interview? What would you want to communicate to them? Yoshua Bengio: It’s a good question. I would say something that’s difficult for people to do: try to leave your prior beliefs about intelligence and the efficiency of markets or whatever your beliefs, and just try to focus on the evidence — the evidence that has been collected empirically by the companies and academics and nonprofits in the last couple of years especially, but also theoretical evidence that has been developed over more than a decade in AI safety about the fundamental reasons why, for example, if you do reinforcement learning, you’re going to get reward hacking. I think a lot of people like machine learning researchers simply haven’t even taken the time to read those papers, so it’s easy to dismiss as “these people must be biased by science fiction” or whatever it is. When you actually look at the theory and the experiments that are in front of us, it’s much harder for a scientist to deny that reality. So I would encourage a kind of openness of mind to take the time to read through the evidence before committing to a view, and that will be the scientific thing to do. Unfortunately, there’s a bad polarising effect here. Once a person commits to a view of it’s going to be fine, for psychological reasons it’s very difficult to back from that — because you want to feel good about the things you said in the past. So it’s difficult to say, “I changed my mind, I made a mistake,” but this is the right thing to do from an epistemic, scientific point of view. If scientists didn’t accept that they could have made mistakes in their theories, interpretations, and so on, then we wouldn’t have progress. We wouldn’t have scientific progress. It is when people are willing to question their own beliefs and look at the evidence that we can make progress. Rob Wiblin: My guest today has been Yoshua Bengio. Thanks so much for coming on The 80,000 Hours Podcast, Yoshua. Yoshua Bengio: Thanks for having me. Rob Wiblin: And thanks for all you’re doing. Yoshua Bengio: And you, too.