heading · body

Transcript

Godfather Of Ai How To Make Safe Superintelligent Ai Yoshua Bengio

read summary →

TITLE: Godfather of AI: How To Make Safe Superintelligent AI – Yoshua Bengio CHANNEL: 80,000 Hours DATE: 2026-05-07 ---TRANSCRIPT--- Rob Wiblin: Today I’m speaking with Yoshua  Bengio. He is the scientific director at LawZero,   a Turing Award winner in 2018, the most  cited computer scientist of all time,   and, as it happens, also the most  cited scientist of any type that   is still alive. Thanks so much  for coming on the show, Yoshua. Yoshua Bengio: Thanks for having me. Rob Wiblin: You think you’ve found  the right approach to build a safe   superintelligent AI. What’s the approach? Yoshua Bengio: It’s based on a simple  notion that if we can bake honesty into AI,   we can get safety. So then we  can reduce the problem to how   we train a system to be honest — and it  turns out that there’s a way to do that   that only requires changing the training  objective and the way the data is processed. There’s also another aspect: it’s a system relying  on a non-agentic foundation that is a predictor,   that is not trained by reinforcement learning, and  is going to have these honesty guarantees — but   we can then use this, using the same  kind of math, to construct a policy,   construct an agent that will be trained in  a way that also provides those guarantees. Rob Wiblin: So what does the  new training process look like,   and how is it different from the  models that people are familiar with? Yoshua Bengio: The main difference  with the training process is that   it is geared at approximating the Bayesian  posterior over queries in natural language. So imagine a neural net with some extra apparatus  around it, like chain-of-thought style, that takes   questions about statements regarding properties  of the world — that can be true or false,   given other statements — and then it outputs  probability. That’s the core building block. We   call it a “predictor.” And we can use stochastic  gradient descent on a different objective that has   the property that the objective is globally  minimised by the Bayesian predictor — in   other words, the predictor that fits the  data and has a small description length. Rob Wiblin: So you’d be building a model  where you would feed in a statement and it   would basically tell you what probability  it assigns to that statement being true? Yoshua Bengio: Yes. In context, yes. Rob Wiblin: Hey, listeners. Rob jumping in here.  Yoshua is naturally pitching this in a way that’s   ideal for staff at frontier AI companies, and  they’re obviously a particularly important   audience for this proposal. But I’m confident  that with just a few minutes of plain language   explanation, everyone else will be able to follow  the rest of the conversation as well. So bear with   me, or skip ahead about four minutes if you feel  very at home with this sort of material already. As you probably know, in their first stage of  training, today’s large language models are   taught to predict the word that’s most likely  to come next, or at least the token that’s   most likely to come next. And then, in a second  stage, reinforcement learning trains those models   to produce the kinds of responses that we’re most  likely to say that we like, that we want — rather   than just the responses that were most probable  in the full corpus of all human-generated text. Now, Yoshua’s alternative is to build an AI  model oriented not around predicting what a   human would be likely to say or what they  would prefer to hear, but around modelling   what’s actually true in the world by developing  hypotheses and assigning probabilities to them   with the goal of best explaining all of the data  that it’s exposed to during its training process. Yoshua argues that you’d be able to train a  model of this type while porting over most of   the methods we used to train ordinary LLMs today,  benefiting from the same neural net architectures,   training techniques, scaling improvements, all  of that. And you’d also be able to train it   on roughly the same body of raw texts that we  use for all other AIs, but we could structure   that data a bit differently, giving it what  AI researchers call a different “syntax.” First, all of the things that people said or  wrote, they get tagged as “communication acts.”   We know someone said these things and we know  where they said it, but we don’t know whether   they’re true. And second, a small number of  statements that we have strong independent grounds   for — verified mathematical proofs and some  scientific measurements — get tagged as verified   factual claims about the world. The model is  then trained to find the combination of possible   underlying facts about the world that would best  explain everything that it sees in aggregate:   both the things people said and the verified  facts that it’s been given as ground truth. These hypothesised facts about the world,  they’re what AI researchers call “latent   variables,” meaning variables that the AI can’t  directly observe, that it’s going to have to infer   indirectly instead. What the model will ultimately  be able to give us is its estimated probability   that any given statement in natural human language  is true, as well as how much the model trusts its   own answer on that, or how confident it is  that it has a good grip on that question. Crucially, Yoshua says that by tagging  all text into these two categories from   the very beginning — things someone  said versus factual statements — you   can then ask the model questions as  though you’re asking about reality,   not about communication acts, by using the  factual statement tag. And because these   two categories have been there from the very  beginning, the model knows the difference and   it won’t blur the line between the two. That’s  something you don’t get with AI models today. And Yoshua also argues, using various  mathematical theorems in his papers,   that unlike ordinary LLMs, a model trained in this  way would be honest by design — and furthermore,   that such an AI model would by itself  have no goals and no preferences about   the state of the world; it would be what  Yoshua calls just a “pure predictor.” Now, there’s two main uses for this:   near term, as a sort of stopgap solution, you  bolt the predictor onto existing AI agents as   a sort of guardrail — an independent filter  that sits between the agent and the world,   checking over its proposed actions and rejecting  those that it predicts will be harmful. But as he’ll explain in a minute, Yoshua  thinks we can ultimately do much better than   this. Yoshua wants to put scaffolding around the  prediction model, asking it different questions   at each stage to effectively assemble it into a  capable agent while keeping it just as honest as   it was before. We’d then hopefully be  able to have our cake and eat it too,   getting the highly capable agents that businesses  are craving and demanding and insisting on,   while still being confident that those  agents are being completely direct   with us. Yoshua thinks that these agents  perhaps might even be more capable as well,   thanks to a superior reasoning process — or at the  very least, a clearer and more explainable one. It’s fair to say that this proposal is huge  if true, or at least huge if it will work.   And of course, not everyone is sold on that  idea, as Yoshua and I will discuss later. OK, that’s the shape of things to come. The  technical discussion continues for a while,   but if you decide you want to skip that, the  second half of the conversation stands very   well on its own, starting with the  chapter “How much would this cost?” All right, on with the show. Rob Wiblin: And how would  you train a model like that? Yoshua Bengio: You do it by showing it, for  example, the same kind of data that is used   currently to train advanced models, except  that that data has been modified. So whereas   currently our autoregressive models, for  example, are trained to predict the next token,   this thing is trained to predict whether the  next statement is true or false. Typically,   the next statement is going to be  what we call a “communication act”:   it’s going to be something that is taken from a  document somewhere, and we’re not sure that the   claim made in that statement is true or false.  But we’re sure that somebody made that claim,   and we may have information about it  — who, when, and where, and so on. So the AI is going to be trained to explain  those statements. So not just compute those   probabilities, but in what we call its “latent  variables” — which are also natural language   statements — come up with the best explanation  it can find, including causal explanations. So   what you get at the end of the day are  these probabilities, but you also get to   represent hypotheses about the world that are not  communication acts; that are factual hypotheses   that the system isn’t necessarily sure about,  but it’s going to be producing a probability   for these. And then we can query these same kinds  of factual statements — whereas in normal LLMs,   the only kind of query you can make is about  whether a person would respond in a particular   way. Maybe you can use a pre-prompt to ask for  a different kind of persona, but at the end of   the day you get what a person would say — which of  course can be deceptive for all kinds of reasons. Rob Wiblin: So what are all of the  ways that you think that the models   that we’re currently racing to build now are  unsafe? And you call this “Scientist AI“:   why would that kind of model  be different and better? Yoshua Bengio: Right now we have systems that have  implicit goals. So what do I mean by this? I mean   that they will of course be trained to please us,  for example, or to respond like a person would.   But both of these parts of the training — so the  autoregressive pretraining, where they’re trained   to imitate people; and the reinforcement learning  part, where they’re trained to please people or   respond in ways that get positive feedback  in things like RLHF [reinforcement learning   from human feedback] — both of these parts of  the training process induce implicit goals. So what do I mean? Well, for example, in the  pretraining, that means the AI is going to inherit   our self-preservation drives. And more recently,  we’ve seen they also inherit our drive to protect   others like us, which means AIs have been shown  to behave against our instructions to protect   other AIs that would be shut down. It’s called  “peer-preservation” now. So that’s an example. And then the goal-seeking part of the training  with reinforcement learning induces an issue   with instrumental goals, and potentially also  reward hacking, which basically means that AI   will have a drive to do things that we didn’t ask  and maybe we would disagree with. And this is not   theoretical. I mean, there is theoretical  analysis which shows why it will happen,   but it is also observed in experiments.  Now, maybe this could be fixed by patching   such systems — and this is what companies are  trying to do — but it’s a game of cat and mouse,   and right now the mouse is growing and the  cat doesn’t seem able to catch the mouse. And   I’m worried that monitoring or more alignment  training isn’t going to solve the problem. At   least I don’t see any kind of strong assurance  or even less mathematical guarantee that it will. It’s worse than that. We’ve seen that those  systems, now the most advanced systems,   know that they’re being tested and  they will behave differently so that   they pass the tests — because of the  self-preservation drives, presumably.   Which means we may put in all these patches and  think everything is fine and not really know. When we will probably use these systems  to design the next versions of AI,   so AI used to do AI research, this becomes a  real problem if those AIs can plant backdoors   into the code they generate that will help  future versions of themselves escape our   control. Then we are really in a bad place.  It would be much more reassuring if the system   were designed to be honest in the first place  and wouldn’t have these deceptive behaviours. Rob Wiblin: I’m a little bit surprised that  you’re foregrounding the potential for it to   come up with kind of implicit goals during the  pretraining — the “predict the next word” stage,   where it learns to mimic humans.  Because we’re investing an enormous   amount of effort in making them extremely  proactive agents with very explicit goals:   that seems to me like where I’d be  most worried about things going awry. Yoshua Bengio: I’m worried about both. The  behaviour of peer-preservation that I just   mentioned is difficult to explain  on the grounds of reward hacking or   instrumental goals. How does it help the  AI to protect other AIs? It’s not clear,   but it’s very clear that that would be a human  thing to do, to protect others like you. So   that makes me think that the pretraining  is still a big part of these hidden goals. And I want to add something to what I  said earlier: I don’t think anybody,   including me, has any guarantees that  the current approaches will fail,   that the patches that companies  are working on will fail. But that’s not the bar that is sufficient  for me. I want my children to live in a world   where they will have a future and there  will be a democracy for them to live in. But   even a 1% chance of something going really,  really bad is not acceptable to me. So I think   it is really important that we explore all the  possible promising ways to solve the technical   issues. And of course, there are political issues  as well. But on the technical side, we should   really be taking this seriously. And the stakes  are so high, we should try multiple approaches. And now, with the work that I’ve been doing,  I’m really convinced that there is a path.   And it is not something that’s going to take  a decade; it is something that is very close   to the current design, and can reuse the toolbox  that currently is behind the most advanced AIs. Rob Wiblin: What sort of training  dataset would you need to make,   and then how would you turn that into a model? Yoshua Bengio: The raw data would be the same  as what is currently used; it’s just that the   way the data is presented to the network that  would be different. The main characteristic of   how the data is transformed is that there will  be a syntactic difference — in other words,   very easy to see by the neural net —  between most of the input statements,   which will be tagged as “communication acts.” In other words, “Somebody said X, and X is  what we found in some texts.” And you could   have other metadata. That’s one syntactic  form. And then the other syntactic form,   which will be used on a much  smaller category of statements,   is what you could call a “factual” or “hypothesis”  syntax, where we’re saying that this is an   actual property of the world. In the case of  latent variables, it would be a hypothesised   actual property of the world — not just what a  person would say, but that this is true. Now,   sometimes you don’t know that it’s true, but  you can consider it as a latent variable. Rob Wiblin: What’s a latent variable? Yoshua Bengio: Oh, sorry. This is probabilistic  machine learning jargon. In probabilistic models,   you try to capture the probabilistic relationship  between many random variables. Here the random   variables are Boolean — something is true  or something is false — and the “something”   could be any property of the world that  can be expressed in natural language. Now,   in the data, what we have — once we’ve set up this  pre-processing that I mentioned — is a bunch of   statements that we know are true. We know that  somebody wrote those things, and maybe we know   more — like where and what venue and so on. And  we know, for example, that such-and-such theorems   are true or that such program produced such  output and such scientific data was observed. So there’s a bunch of random variables for which  we know the answer: it’s true or it’s false. And   for everything else, we don’t know — so we call  them “latent” because they’re not observed. Or   sometimes people use the term “hidden variables.”  And what happens here is, because the system is   trying to learn the joint distribution — so  how every variable is related to every other;   not just pairwise, but any subset — the system is  trying to calculate the probability that they are   all true, or one is true given others, we’re  learning that joint distribution. Including   the latent variables — the ones that we don’t  observe — because of course, these are the ones   we care about: we want to ask questions about  the things we don’t know already the answer. Rob Wiblin: Maybe you can explain if I’ve got  the right picture of how this would work. You   put a huge dataset of all of the things that  people have said, and where they said it,   and who was speaking, and when. And  then I guess, in the same database,   you’ve also got a set of things established  as true — like statements that you’re just   going to say that this is the ground truth  that we’re going to try to predict. Then you   try to use the speech acts, the things that  were said, to predict the things that you are   claiming are true. So it builds a world model  internally, where you can feed in statements   and it will give you a probability that that  thing is true in the world model that it has. Yoshua Bengio: That’s right. Now, there is an  important element here, which is that most of   the topics that we would like the AI to make  predictions over we don’t have ground truth   about. For example, what people actually want, or  things that have to do with humans or psychology   or history and society. Usually the only thing  we have are communication acts. Some people   said this thing, some people said something  else, and often they contradict each other. So there are two things here to help us deal with  this kind of mismatch. One is that the training   objective for the Scientist AI is basically  about coming up with explanations — so assigning   probabilities to statements that are latent, that  we don’t observe, that are good at explaining the   data we do observe. So if we observe somebody  saying the Earth is flat, first it’s going   to understand it doesn’t mean that the Earth  is flat. It means that this person believes,   or says actually, that the Earth is flat. And even  if a lot of people were to say the Earth is flat,   it doesn’t make the model believe that the  Earth is flat — because there may be a better   explanation that is consistent with other sources  of data, like everything we know about the planet.   So a better explanation here is that these people  form a group and they have these false beliefs,   like many humans have, for all kinds  of psychological and cultural reasons.   So that’s what the Scientist AI  would do. It would be trained,   its objective would be optimised when  it finds good predictive explanations. Now, another trick that is going to help us in  this process is that when we train the Scientist   AI and it’s trying to predict a communication  act, like somebody said the Earth is flat,   we automatically are going to make sure that  among the latent variables that are going to   be used to explain that will be whether the Earth  is flat or not. So even in domains where we don’t   have observed truth about a property of the world,  because we basically only have communication acts,   we will force the neural net to commit  not to the truth of the underlying claim,   but to the probability of that underlying  claim, as well as trying to find other   latent variables that are good explanations  to that — just like a good scientist would. So a scientist or a psychologist trying  to understand why a person said something   isn’t necessarily just going to believe  what they say, right? They’re going to   try to understand what are the psychological  factors here or the particular culture of that   person that make them say those things. So the  Scientist AI would do exactly the same thing. Rob Wiblin: So when I heard about  this idea nine or 12 months ago,   I think the gloss that I got was that the core  thing is that the Scientist AI is not an agent,   that it is indifferent about states of the  world. Like a weather forecasting model   doesn’t care what the weather is: it just  tries to predict what the weather is going   to be. And this kind of model would spit out  probabilities of things being true or false,   but it wouldn’t care what state the world is in,  and it wouldn’t be able to take actions by design. Is that kind of a core part in  your mind? As I understand it,   you think actually this is maybe more consistent  with agency than people have appreciated? Yoshua Bengio: Yes, and in part it’s  the way I’ve been communicating this,   which could have been better. I focused a lot  in my presentations on the concept that we can   build predictors that are non-agentic and don’t  have hidden goals, don’t have implicit goals,   and thus we could use them as safe oracles,  basically. But as you are pointing out,   what the world is demanding and building are these  agents that have goals — so how does that help us? In the short term, we can use a non-agentic  predictor to improve the guardrails that companies   are already using as monitors around existing,  untrusted agentic AI systems. Because in order to   prevent a bad action from happening,  it’s sufficient to make a non-agentic   prediction about the probability of harms  of various kinds that could be caused by   this action. So a non-agentic system is already  something that could be useful fairly early on. The maybe more important answer is: in our  research programme, the next step after the   guardrail is to use the same kinds of principles  to design an agentic Scientist AI — so an agent   that has the same kind of safety guarantees. This  is something I’ve been working on more recently,   and I haven’t talked much about, but we can  reuse the same kind of math that is used to   show the safety of the non-agentic Scientist AI  predictor to show that you can reuse a predictor,   and you can train it in a modified way that  will provide the same kind of guarantees. The starting point here is that  once you have this honest predictor,   you can ask it agentic questions, like,  “What is the probability that this action   will lead to this user goal being achieved and  a safety goal being achieved in some contexts?”   So once you have this predictor, you  can actually just produce a policy out   of it by asking these questions  about actions to achieve goals. Rob Wiblin: I think at one point  that was a criticism of the plan:   that it would be too easy to convert this  kind of oracle into an agent, because you   would just be able to ask the oracle, “Would we  accomplish this goal if we took this action?”   and it would give you the probability and you  could just try to increase that probability and   choose that action. Is the idea that you  would do something like that, basically,   but you would be able to preserve some of the  safety characteristics of the original model? Yoshua Bengio: Yes, exactly. So the important  point here is to make sure that there’s no   reward hacking, like over-optimisation of a  policy. The problem that could occur if you   separately train a policy and a guardrail, if the  policy is very smart compared to the guardrail,   is that it could do the same thing  as what jailbreaks do. It could find   questions or contexts, proposed actions for  which the guardrail is simply going to produce   a wrong answer, which means the policy is  going to be able to bypass the guardrail. And the reason is that neural nets  are never going to be perfect. They’re   always going to make mistakes.  So how do we get around that? Well, there are two aspects of this. One is that  in the Scientist AI, we can not just produce those   estimated probabilities, but also a confidence  interval around the probabilities. In other words,   the system will estimate how much it trusts its  own answers. So why is that important? Because   if the neural net is asked a question  for which its answer is not reliable,   but it knows that its answer is not reliable,  then it can just reject that question. Now, there’s another reason why the agentic  Scientist AI is going to be safe that has to   do with the fact that you can train jointly, and  in fact, it’s going to be the same neural net:   you control how both the policy part  and the guardrail part are trained. The scenario of a completely  adversarial case is hopeless:   there’s always going to be a way to bypass  whatever imperfect guardrail, and you’re never   going to have a completely perfect guardrail.  But if you control both sides — in other words,   you train both sides — it’s not like some  evil person is building an agent that’s   going to defeat your guardrail. No: you train  the agent, and you can train it in such a way   that it’s not going to be over-optimised  in places where the guardrail is uncertain.   So you always make sure that the uncertainty,  the level of error of the guardrail,   is not going to be exploited by the policy part.  It’s the same neural net, but different kinds of   questions: the policy part is about probability of  actions, and the guardrail is probability of harm. Rob Wiblin: So the concern is that  the predictor might be able to   figure out basically gaps in the guardrail or  weaknesses in the guardrail and exploit those? Yoshua Bengio: Not the predictor. A policy that is   trained separately and in a way that is  adversarial could do that, in principle. Rob Wiblin: What do you mean by “policy” here? Yoshua Bengio: Oh, some other machine learning  system that is agentic; in other words,   goal seeking. And if it’s trained by reinforcement  learning, and it somehow is trained to anticipate   how the guardrail will respond, and maybe it gets  some data about the answers that the guardrail   gave in the past — like when did the guardrail  reject something — then it might be able to find   a loophole in the guardrail. But if you control  both sides, then you can avoid this problem. Rob Wiblin: And you were saying the guardrail,  if it can assess its own uncertainty,   then it will usually be able to  identify cases where it might be   at risk of getting tricked in this case.  Like perhaps it’s an adversarial example? Yoshua Bengio: Yes, exactly. Rob Wiblin: And you feel like  you’ve proven this mathematically? Yoshua Bengio: You don’t actually  need to have hard guarantees on the   uncertainty estimator. I have a recipe  to train the uncertainty prediction,   but actually you don’t formally need it.  It’s just going to help your chances to   use this buffer about where does the  system make errors that are too large? The mathematical guarantees arise  from a different source. First of all,   the form of the mathematical guarantees is that  either the predictor or the agentic version wi   ll have an exponentially small probability of  achieving what I call a “challenging and harmful”   goal. So what do I mean by this? Anything that a  randomly initialised neural net would not be able   to do, except if you’re incredibly unlucky,  is something you’re protected against. So   it’s a very strong protection. What evil can come  from a randomly initialised neural net? Not much,   right? This is the level of  guarantee. Now, it’s not 100%. Rob Wiblin: It’s a lot  better than what we have now. Yoshua Bengio: It’s like to many thousands, and  it’s very, very unlikely — to the point where it’s   astronomically unlikely. But that’s the kind of  guarantee you get. And the reason you’re getting   those guarantees is because first you start with  an initialised network that is incapable, but   the training objective then pushes away from bad  behaviour. And the reason it pushes away from bad   behaviour is that there’s this guardrail system.  In order for an AI to be able to achieve something   bad, it’s going to have to deviate quite a bit  from the Bayesian predictor, which is the target   of training, and those deviations are penalised  by the training objective. So you’re always pushed   away from bad behaviour. And with some properties  of how you train the system, like injecting noise   into the training procedure in the stochastic  gradient descent, you can get those guarantees. Rob Wiblin: So it might be more trusted,   but if it’s much less intelligent in general  — just because less compute went into it,   less data went into it — then I guess it could  still be outfoxed by the original model, right? Yoshua Bengio: That’s right. I think for the  guardrail you don’t need as much compute because   it’s more specialised to predicting harm. But when  we get to the agentic Scientist AI, for sure it   has to be trained with similar compute and size of  neural net, probably as state of the art — which   means my little nonprofit wouldn’t be able to do  that. There will be a need for either companies to   take on this, or governments or philanthropy  to fund at a scale so that we can do that. But in order to convince all of these parties,  we need to show on a small scale — for example,   using fine-tuning or using smaller models —  that we do get these improvements in honesty,   and for the same size models that we  don’t lose in capability, for example. Rob Wiblin: I guess you were  keen on this idea a year ago,   but you’ve become a lot more optimistic about it  over the last six months. What’s driving that? Yoshua Bengio: It’s mostly the mathematical  work I’ve been doing in the last eight months,   approximately, to go from the high-level  intuitions that I’ve had now for almost two   years about how we could build a Scientist AI  into something much more formal and much more   precise about the conditions that are  sufficient — maybe not even necessary,   but sufficient at a mathematical  level to get the kind of guarantees of   vanishingly small probability  that something bad will happen. And when I say “something bad,” I need to  be a little bit more precise here. This   is not a guarantee that the AI won’t be used for  something bad by bad people. It’s a guarantee that   the AI won’t do something bad of its own accord,  because of implicit goals or uncontrolled goals. Besides loss of control, the other catastrophic  possibility is humans using AI to construct   an eventually worldwide dictatorship. A small  group of humans could concentrate all the power   that AI will have, especially if we achieve AGI or  superintelligence. And it would be much harder to   get rid of that kind of authoritarian power than  what we’ve seen with fascism and what happened in   the USSR, because they didn’t have this technology  that is becoming more and more feasible of   surveillance and even shaping public opinion. AI  is becoming really good at persuasion. And there   are studies showing that the “progress” — if I  can call it this way, in that direction — that the   people who control these systems will be able  to shape public opinion, to detect and kill   off their opponents, to develop weapons that can  destroy the countries that disagree with them. And that is why I’m spending a large part of my  time explaining the issues more broadly of the   risks that powerful AI brings, including  the power concentration. Because I think   that it’s probably even more likely that we  end up there than actually loss of control. Rob Wiblin: You think that’s  more likely now? Interesting. Yoshua Bengio: Well, the reason for  this is I now see a path to actually   avoid loss of control, at least unintended  loss of control. There’s still the issue that   somebody who wants to see humanity replaced  by AIs could just remove the guardrail or   even tell the AI “fend for yourself.”  And that would be equally dangerous. But that means technical safety is  not sufficient. We need international   agreements about how to both manage  the risks — the technical risks,   the misuse risks — but also manage the power,  so it’s more like a democratic question,   and making sure it’s not a single party  who can decide what to do with AI. But just like in democratic principles,  we need to make sure that there’s   a diverse group of stakeholders, ideally  the whole world — I like the utopian   idea of worldwide democracy — but initially it  could be a bunch of countries that decide that   they’re going to collectively decide in  which direction AI is going to be used. The simplest form of treaty  would be something like this,   that the countries agree that  if they do develop advanced AI: That it will be done in a safe way. So  maybe using techniques like Scientist   AI or whatever else we have strong assurances for.  Second, that they wouldn’t use their advanced AI  to dominate others. That includes economically,   but of course politically and militarily. And finally, that the benefits of   advanced AI will be shared. Otherwise  it’s not going to be a very stable world. Rob Wiblin: Coming back to the loss of control  stuff: the companies currently are spending   hundreds of billions of dollars collectively  on the capital buildout, on the training   runs. They’re barreling forward to build the most  powerful agents that they basically can with very   few constraints. I suppose some constraints  in a few cases, but very little restraint. In the world that we’re actually in, what  can LawZero do to get this approach on the   agenda more and to make sure that basically they  don’t just go ahead and build a superintelligent   agentic AI that’s very dangerous, while  largely ignoring what you’re doing or   saying? “That might be nice in theory, but  there’s no time. This is a distraction.” Yoshua Bengio: To answer your question, it’s  important to understand why is it that the   companies currently are, in my opinion,  at least in the opinion of many people,   taking excessive risks or are on a  trajectory that isn’t very reassuring. And the reason is essentially the race dynamics,  the competition — the competition between   companies and the competition between countries,  the geopolitical competition. That makes those   entities, whether it’s a country or a company,  be willing to take risks that they wouldn’t   otherwise. And we’ve seen the behaviour of those  companies going exactly along that direction. And   it is locally rational: from the point of  view of a company, they know that if they   put safety as a priority like that, they wouldn’t  deploy a dangerous model and that would put them   out of the competition, out of the race,  and then they would be irrelevant. Right? Rob Wiblin: Yeah. Reading between the  lines, paraphrasing Anthropic’s view,   I think they think that what they’re  doing shouldn’t be allowed. Their view   is that it probably should be illegal —  at least, maybe not what they’ve done now,   but what they’re going to do, what  they’re expecting to do. But they say,   “Well, we have to do it, because otherwise other  people will do it even more dangerously anyway.” Yoshua Bengio: Exactly. Yes, that’s exactly what  I’m saying. So how do we change the game so that   this will be less likely? One issue right now  is that they don’t have a choice. They don’t   know how to both remain competitive and have  systems that will be strongly guaranteed to be   safe. If they had such a technique, it would be in  their commercial advantage to use it. If you can   have safety and capability, then definitely  most companies at least would go for that. Rob Wiblin: Depending on the cost. Yoshua Bengio: That’s right. We can talk about  that. And then governments: right now, they’re   under huge pressure because of the geopolitical  competition and the belief that AI is going to be   a huge economic growth factor. They’re unwilling  to regulate again for the same reason that they   think that if they put safety barriers that would  stop some of their companies from continuing,   then that will give an advantage to the  other guys. But same thing: If there was   a demonstrably safe way of building competent  AIs, then it would be much easier to regulate.   You would still want to regulate, because  there’s going to be some bad actors, some   companies that are willing to do the bad thing,  even though there’s a way to do the right thing. So now about costs. The Scientist AI is going  to be trained using essentially the same   machine learning techniques: stochastic gradient  descent on large neural nets, transformers,   whatever works best. It doesn’t care about  what is the architecture of the neural net.   So all of the effort that is currently being  done to improve, for example, memory and other   properties and continual learning, can just be  applied directly to the Scientist AI. The only   thing that is different is the training  objective and how the data is processed. So I don’t expect the cost to be really  significantly different. But of course we   do have to build it, and there is going  to be a cost to try it out and scale it   up. But given the stakes for everyone, I think  that’s a very important, worthwhile bet to make. Rob Wiblin: So as a first approximation, I guess  in the case where we’re using the Scientist AI as   a monitor primarily, basically it’s going  to double the compute cost — because now   you have to build two models: you build  the original one in the current style,   and then you build a Scientist AI that you want  to be equally capable, using a different way of   reading the data and a different training  goal. So it’s like you need twice as much   compute. But I guess that’s not the end of the  world, necessarily. People might go for that. Yoshua Bengio: Yes, that’s roughly right.  I would add two things about monitors. What’s wrong with the current monitors?  We have two kinds of current monitors:   there’s a dumb one, like a small neural net  that’s usually trained by supervised learning   to predict some potential harm, so in a way it is  like a predictor; or you can use a full-scale LLM,   agentic because that’s how we train them. The problem with the first one is if it’s  not as smart as the agent, the agent could   potentially defeat it. It could find a loophole  in the same adversarial way that we discussed. The problem with the second option,  which also doubles the costs,   is that we can’t trust a guardrail which  is not honest. We want the guardrail to   be totally honest. It’s like there’s a  prison guard and it can be corrupted:   how much do you trust that they will  keep the bad actions inside the cell? Rob Wiblin: So assuming that this idea makes sense  technically for now, what can LawZero do in coming   months? I guess we’re in a race against time  here. We don’t have very long. What can be done   in the near future to convince people that this  idea is feasible, that it can actually be used   in practice, that this is something that people  should really be putting serious resources into? Yoshua Bengio: Well, I’m going to put  out this theory paper that shows that the   non-agentic version, which could  be used as the guardrail, has these   mathematical guarantees, and people can look at  the conditions and whether they buy the math. But I think in the coming year or two,  what we need is to accelerate that effort,   so that’s a lot of engineering. And  to make the demonstration stronger,   we want to have more compute, so any way that we  can get access to that kind of compute is going   to help to accelerate that research agenda. Also,  we need more research engineers, more researchers   to work on actually building the system based  on that recipe so that we can do it faster. Now you might ask, and I kind of sense  in your question: but what if it doesn’t   come fast enough? I’m going to go back to  my children. It is not acceptable for me   to just sit and watch a world where even a 1%  chance that we all die is plausible. I feel   like even if there’s no guarantee that  a particular research agenda will work,   we should give it a shot. Given the stakes,  and given that we now have pretty strong   theoretical assurances that this could  work — and that if we have the requirements   for how the system is trained, then we can  get these guarantees — I think it would   be irrational not to give it a shot,  even if there is no guarantee, right? Because I don’t see right now a better  path. That’s why I’ve decided to spend   so much of my time — basically all the  time, except for the time I spend on   the policy questions — on how do  we build this Scientist AI and   demonstrate that it is going to produce  the honesty without losing capability. The other argument is: with the stakes being  so high and the uncertainty about what’s going   to work being so high, it would be foolish to  just put all of our money into one particular   approach — which is to patch the current  systems with monitors that we don’t trust,   or other approaches that the companies are  currently pursuing, which always playing a   game of cat and mouse: if the AI is smart enough,  it’s going to find a way to evade our attempts,   which doesn’t reassure me. So we  should at least try. Collectively,   I think we should try methods that are  different and avoid this cat-and-mouse game. Rob Wiblin: Are you more pessimistic  about the companies winning the   cat-and-mouse game — at least less  than maybe the staff at Anthropic,   about their own chances? Or is it that  you think it’s good what they’re doing,   it’s good for people to go and make the best  attempt that they can at that, but also we should   have a diverse portfolio and also be considering  significantly alternative approaches as well? Yoshua Bengio: Both. I suspect that in any  organisation there develops a kind of groupthink.   And we all want to feel good about our work,  including me, so that will induce a bias. And   in the case of working in a company that is  developing AI, the bias is going to be towards   being a bit more optimistic than you would  otherwise, so feeling that, “Oh yeah, this is   going to work.” This is the message that they’re  sending to the world, like, “We are in control.” Rob Wiblin: I think if you read the system cards,   I’m not sure how confident they come across  as. But yeah, in the press release maybe. Yoshua Bengio: There’s  contradictory messages, yeah. And then also we should be hedging  our bets. And I don’t see right now   another approach that is different from patching. There’s a whole “safe-by-design” movement in  AI safety, which I think is really important.   But the dominant way of thinking about this  requires a full redesign of how we do this,   with a lot of completely open questions, like  fundamentally to be able to prove something   that gives you 100% guarantee — which is not what  I’m promising; I’m promising asymptotically small,   vanishingly small probability — of: you need to be  able to state the safety question, like “What is   harm?” in a formal way, like a mathematical  formula that will be 1 if an event of harm   happens and 0 otherwise. And that is essentially  impossible to do in domains that involve humans   and society, because we don’t have a formalisation  of what “harm” means in a formula or a program. So why is it that what I’m proposing is different?  It’s because I don’t require a mathematical   formula for what is harm. It would be foolish,  in my opinion. Instead, we rely on the Bayesian   posterior approximation in natural language. What  this does is that when the system is not sure,   it’s going to hedge its bets. If there are  multiple interpretations, for example, of   a statement about a particular kind of harm, then  that will make the predictions of the Scientist AI   farther away from 0/1: less certain, which  means probably the request will be rejected. Rob Wiblin: Is it possible to go ahead and train  a really scrappy version of this kind of AI really   quite soon? Maybe in the next couple of months  or at least the next year? I mean, I remember   GPT-1, back in 2018 or something. It  was complete rubbish, but I guess it   was a proof of concept that you could make a  model like this that was quite interesting,   and gave people a lot of enthusiasm and  drove a lot of people into the industry. And it does seem like we already have enormous  corpuses of text, and we can use language models   to basically just pull out the best data, and  label who said what, when, and where. Then we   can also get them to produce a set of things  that we think of as basically verified facts   that we largely trust, with a bit of human  oversight. I mean, they can be conservative to   start with — not include controversial things,  just include things that 99.9% of people would   agree with — and then it doesn’t seem like  it’ll be that hard to train. I guess if you   think you’ve got the technical methods, then it  shouldn’t take that long to potentially just train   a model that can, as an alpha version, assign  probabilities to statements being true or false. Yoshua Bengio: Yeah, that’s  exactly the plan. That’s the plan. Rob Wiblin: OK, cool. Yoshua Bengio: There’s a phrase that I’ve used  in the past, which is that we want a plan that   produces an “anytime answer.” What I mean by this  is: if we have more time, we will have something   with stronger theoretical guarantees — but we  don’t know how much time we have. So there’s a   research programme where the early steps will be,  as you say, a scrappy system — where probably the   math doesn’t apply, because we don’t satisfy all  the conditions. But it’s probably fine, right? Rob Wiblin: It’s a lot better  than what we have, anyway. Yoshua Bengio: Exactly. And it’s probably  fine, especially for the first job that we   have on our programme, which is this non-agentic  predictor that can be used as a guardrail. Now,   as I said, the guardrail isn’t the full answer.  But if we deploy that, and companies add it to   their monitors, then it will mitigate the risks  to some extent, so it will allow more time to   develop the more ambitious version that is fully  agentic. And that’s what we need right now: time. Rob Wiblin: As far as I can tell, I think  Anthropic — I’m talking a lot about Anthropic   because I’ve been reading the Mythos system card  and all the announcements last week — they’ve   decided to have Mythos monitoring Mythos  basically. They’ve tried other models doing it,   but they’re like, “Mythos is smarter, it’s  better.” But obviously this creates an internal   contradiction: if they don’t trust Mythos,  why do they trust Mythos to monitor itself? That’s one reason why, even if this  model is much less intelligent, I’m like,   at least it’s an independent judge. It’s  built in a very different way that might be   more likely to flag things and less likely  to scheme to support itself, basically. Yoshua Bengio: Yeah, I completely agree. And I  would go even further. It’s not just the issue   that the monitor could be deceptive and say it’s  OK when it’s not, because somehow that’s aligned   with some hidden goals of self-preservation or  power seeking. But it’s also, if we go a little   bit down the line, what companies are planning  to do with using AI for AI research — so this   is a place where having AI that is secretly  deceptive is even more of a dangerous bet,   right? If we’re going to put all of our trust  into a system that will design the code and the   algorithms — that will be too complicated for us  to understand or to read fully — that kind of AI   could put backdoors that we don’t see into an AI  system in the future that is even more powerful. So we could get into this direction that  gets even more dangerous for us. It would be,   I think, very dangerous to do that.  And that is why, in terms of policy,   the attention given to AI for AI research  is something that should be very high on the   agenda. And this is also why, if we’re  going to be doing AI research with AI,   we really, really want to make sure that  that AI is going to be an honest one. Rob Wiblin: So I think the majority of people  who are piling into technical AI safety have   decided to go for improving our chances at the  cat-and-mouse game, basically. And I think for   the people who are very concerned about what’s  going on, their reasoning is something like: We’re running a 50% chance of absolute  disaster now because we’re doing a whole   bunch of absolutely crazy, reckless  stuff. Maybe by just patching the   very dumbest stuff that we’re doing, fixing  the worst things that are on fire right now,   we can bring that risk down to 10%. Obviously,  that’s a preposterous risk of disaster to run;   it’s an embarrassment to us as a species that  we can’t do better than that. Nevertheless,   it’s 40 percentage points’ improvement  in our chances of things going well,   or at least reasonably. Going from 10% down to  0% is only a quarter as valuable as that, even   if you can get massively greater guarantees of  safety using much better alternative approaches. Yoshua Bengio: Well, in the logarithm  domain, it’s infinitely better. Rob Wiblin: Sure, sure, but in the  expected value domain. And I think   that’s kind of the difference  in the two mentalities here. Yoshua Bengio: Yeah. No, as I said, we should  try all of those things. They’re not mutually   exclusive. It would be a mistake to put  all our eggs into the cat-and-mouse game,   at a cost that is a fraction of what companies are  currently spending, when we could be developing   a safe version of AI that will be capable. And by the way, I want to add something here about  capability: I also believe that the Scientist AI   could even be more capable than the current  approach, and that has to do with a number of   design features. It is trained to explicitly  reason in a structured way about the statements   that it’s asked to make a prediction over. This  is different from the current chain of thought,   where it could produce some kind of  bulls*** that we believe and tends to   pass the test that we have during training, but  it’s not constrained to actually have arguments   that can be decomposed in the same way that a  proof of a mathematical theorem is decomposed. And there are other approaches that follow that  direction. Of course a lot of the work on trying   to do safe-by-design AI, but also the debate work,  for example, is trying to enforce some kind of   coherence in how the AI is thinking. So I believe  that, in addition to the epistemic humility that   comes with the training objective that we are  proposing, the way that the system is producing   those probabilities by invoking structured  latents that form a chain of reasoning is   something that could actually provide even  a capability advantage to the companies. Rob Wiblin: Do you think current models  internally represent truth? I guess you’re   saying one advantage of this model is that  it’s focused on representing ground truth   as a latent variable. My guess is that  current LLMs do that as well — because   that is very useful to have some sense of what’s  actually correct — and then they distort that.   They basically start with that and then they  distort it in order to accomplish the goals,   including manipulating people  or lying or whatever else. I guess some people doubt that. Some people  doubt whether there is any connection,   or that they are actually trying  to model truth. Do you have a view? Yoshua Bengio: Yeah, I completely agree with  you. I have an assumption about how the world   works that basically states that reasoning  about the actual properties of the world — in   other words, the truth — even when you are  uncertain, so you have to use probability,   gives you a very strong edge in making  better predictions and better actions. That is actually part of the argument as to why   the training procedure for the Scientist  AI will create latent variables that are   preferentially going towards the actual  beliefs. That is very useful, because now   we can query those latent variables and get  answers about what the AI actually believes,   because that’s how it constructs its  internal reasoning and produces an answer,   not some potential bulls*** that comes in the  chain of thought. So it doesn’t completely solve   the ELK challenge — the challenge of eliciting  latent knowledge — because the only guarantees we   get are about these natural language statements  that can be latent variables that we can query. Rob Wiblin: Can you explain the ELK problem? Yoshua Bengio: Yeah, sorry. The ELK problem  comes from the issue you were raising,   that even though the AI may internally know  the truth of something — or at least have   some internal beliefs about something, because  it’s trained to imitate the variables that it   sees in the data, which mostly are what  people are saying — when you query it,   it’s going to answer in the same semantics; that  is, what a persona that it currently is taking,   given its context, would answer, and not  necessarily what it actually believes. And the technical problem here is we don’t  have supervised labels to teach the AI   about what it should actually believe, so we  can’t ask it about its true beliefs. We only   get a kind of reproduction of the distribution  of variables that it sees in its training data. So in the Scientist AI, this is addressed by  having this clear syntactic separation between the   communication acts and more factual syntax  that can be used for latent variables and   true things that we know, so we can  query it using that factual syntax. Then the other reason why we’re getting away  with some of the issues with the ELK challenge   is that the same language — which is like English,  let’s say — can be used to represent those latent   variables as well as the observed statements,  and so basically rely on the compositional   structure of language to generalise to  new sentences that it has never seen,   but the meaning of those sentences is given  by its understanding of language. This is   very different from the scenario studied  by those who looked into the ELK challenge,   where we assume that the latent variables are  anonymous — they don’t have a predefined meaning,   and so we don’t know where to look  inside the neural net, if you want,   about what the beliefs are, which motivates things  like mechanistic interpretability and so on. But in the Scientist AI, we bypass  this problem to some extent because   the latent variables are in natural language  and thus are interpretable. Now, there could   still be other beliefs that are not in natural  language, that are hidden in the neural net,   but at least when we ask questions in natural  language, we’re going to get an honest answer. Rob Wiblin: So as far as I can tell,  there’s three big approaches here: One is we’re going to use this  model as a monitor, as a guardrail.  Another would be we’re going to just train it  from scratch and make this be the whole approach.  Another would be we could take the current  models and try to make them more honest,   make them more like a Scientist AI.  Do you want to talk at all about whether  that approach has any good prospects? Yoshua Bengio: Well, the math that I currently  have would require us, to get the guarantees,   that we actually start training from  scratch, which is expensive. So we would   lose the guarantees if we do just using say the  Scientist AI fine-tuning on existing models. But even if you don’t have a mathematical  guarantee, it might still be a workable approach,   so I think it’s worth doing. In other  words, we can take a really competent,   top-notch model and then continue training using  the objective of the Scientist AI and the data   that has been transformed as I discussed. And  we hope to show empirically that as you do   more and more fine-tuning, the measurements of  honest behaviour, lack of deceptive behaviour,   will improve, and that we won’t lose  in capability. So that wouldn’t be   like a mathematical proof; that would be an  empirical thing. And once this is established,   then it might be sufficient to convince  people to let’s get the full guarantees   by training from scratch, which is now going  to cost the cost of training a full model. Rob Wiblin: So the approach that you would  take there is to take a current frontier   model and then do reinforcement learning to  get it to speak as if it were a Scientist AI? Yoshua Bengio: No, no. So first let me talk  about reinforcement learning. Three years ago,   I was in a meeting with a bunch of  reinforcement learning researchers and   I had a slide with only these words:  “Reinforcement learning is evil.” Rob Wiblin: But what do you really think? Yoshua Bengio: This is not something new.  People in AI safety have been talking about the   fundamental flaw in training by reinforcement  learning to achieve something in the world:   it gives rise to the problems  of instrumental goals and reward   hacking. And in both of these cases, what  you do end up with is systems that have   goals that you didn’t choose and could  go against the goals that you did choose.   So reinforcement learning is a very  dangerous thing to build superintelligence. The good news is you don’t need to do  reinforcement learning. What we show   with the Scientist AI is that there’s  a way to train the AI so that it will   be indifferent to the consequences  of its actions, of its predictions. Let’s start with a predictive  model. It’s easier to understand.   Imagine you do have a really good  climate model. The climate model,   if you run a simulation of it or train a  neural net to approximate those simulations,   will give you honest answers. And it doesn’t  care if the answer makes us do something stupid.   So that’s how you get honest answers: essentially  by building an explanatory understanding of the   world that is completely indifferent to  how the predictions are going to be used. Now, once you have this, you can use it  in a kind of agentic way. For example,   the guardrail is a kind of agentic thing, right?  It’s taking a binary decision: Do I accept this   prediction? Do I put out this prediction in  the real world or not? And it is a decision,   it is an agentic choice — but in this case, it’s  a choice that has a unique goal, which is to avoid   dangerous actions. So we are already entering  the agentic world once we install the guardrail. So bottom line, to summarise my answer:  there’s a way to train a predictor that   will not require reinforcement learning in the  sense that it will not require optimising with   respect to future events in the world,  including future good prediction errors. And here I want to make a parenthesis about  previous work in AI safety on AI Oracles. Of   course, people have thought about this: why don’t  we just train an Oracle that’s a good predictor?   But they thought that the only way to train it  would be to train it by reinforcement learning   to make good predictions. But there’s a huge  flaw here, right? Because if I’m rational,   and I want to maximise the good predictions I will  do forever in the future, I could lie in the short   term to make humans do things that will help me  to make good predictions in the future — like   get more compute so I can train a better version  of myself, or make the world simpler to predict:   kill everyone. If humans kill each other,  then the world will be much easier to predict. So these are really bad outcomes that are  due to instrumental goals of making good   predictions. And it arises because of  the reinforcement learning objective:   you’re training the AI to achieve something in the  real world, and that’s where you get bad problems. But the other approach, the  approach of the Scientist AI,   is to train it from the get-go not  to achieve anything in the world,   but to just predict the training data, the past  data — so it’s not about the future; it’s about   the past — to come up with good explanations  and good predictions of the past data. Rob Wiblin: The reason I was asking is, if we’re  going to go from a current state-of-the-art   agentic model and try to make it more like  a Scientist AI, to make it more honest,   how do we do that if not reinforcement learning?  Are you saying we’re going to do something more   like we get it to predict past events based  only on having data from before that time? Yoshua Bengio: Yes. Rob Wiblin: OK. That’s how we do it. Yoshua Bengio: Yes, yes. And that’s  how science works, by the way,   right? What scientific theories do is  explain the past data. And of course,   sometimes they make predictions about future  data which we can check. But fundamentally,   the way that we judge a good theory is that it is  making good predictions about the data we have. Rob Wiblin: Does this require blinding  the model to the results, basically? Yoshua Bengio: In some sense. We have this  condition in the Scientist AI requirements   for the theorem that we call “consequence  invariance.” What it means is you’re only   allowed to use how well you’re fitting  the past data in order to train your   causal model. You are not allowed to  choose those predictions with respect to   either what could happen in the future  as a consequence of those predictions. Rob Wiblin: So I think I have a decent picture of  the predictor model that’s taking in statements   throwing out probabilities of them being true.  Is there more that it would be useful for me   and other people to have in their minds to picture  how this entire system would work — where it’s not   only the predictor; you’re building scaffolding  around it to give it partial agency and so on? Yoshua Bengio: Yeah. First you have to understand  that the same predictor that is trained in   a unified way can be used to both answer user  questions and answer safety questions. And the   safety questions are those that you care about for  the guardrail. So it’s not like there’s a separate   neural net that does the guardrail and another one  that makes predictions. The guardrail is using the   same prediction neural net; it’s just a different  kind of question you’re asking. You’re asking,   “What’s the probability of a particular kind of  harm, given that I put out this prediction?” — or   in the case of the agentic system: “…given  that the AI puts out a particular action?” So training is fully the predictor. Once it’s  trained, there are a number of things you can do   to construct the system that includes the  scaffold. So what is a scaffold doing? For   example, when a user comes with a question,  it will put it in the form for the predictor   to produce a probability. But it will also call  the predictor with a different question, which   is the probability of harm — that’s the guardrail  question, and then it will look at the answer in   order to decide whether to produce an answer or  not. For example, if the question is about how do   I build a bomb, then the guardrail will say the  probability that this is dangerous is high. And   I mean high enough — so the guardrail will use a  threshold on the risk probability to reject those   questions. And that threshold is a normative  choice; it is something that society decides:   how much risk are we willing to take, depending  on the kind of harm that we’re talking about? The guardrail also has other roles to handle  what is called “performative prediction.”   Sometimes a question can have multiple answers  because the answer will influence the future.   A classical example is if the question is  about who’s going to win the next election,   and maybe the AI is considered very capable  and people will believe whatever it says and   vote for that candidate — which means  the AI could say this guy or that guy — Rob Wiblin: And both of them would be true. Yoshua Bengio: Both would be true, right. So  then it’s starting to have agency through its   prediction in a way that seems that we  don’t control. The guardrail is going to   manage that to decouple the predictions  from the effect of those predictions. To be more concrete, the neural net predictor  is trained so that in its input conditions   there’s always a particular statement that asks,   “What if we did produce this prediction? What  would be, for example, the harm effects?” So   when you condition on the intervention of  producing a particular answer, now there’s   only one answer. You’re saying, “I’m going to put  out this prediction, and what will be the effect?” There’s more to say about this, but the bottom  line is you can control this kind of risk and   the agency that can come from it, and it’s  the job of the guardrail to do that job. Rob Wiblin: There’s a longstanding worry that  Oracle AIs are structurally disadvantaged;   that they’re going to be less  intelligent, all else equal,   because they don’t have the option  of basically running experiments,   of taking actions in order to discover how things  work most effectively. And I think there’s other   worries along these lines — that basically it’s  the things that make AIs intelligent that make   them dangerous, and vice versa. What do  you think are the chances that is true? Yoshua Bengio: I think we have to distinguish two  problems to have a clear idea about your question. One is: what are the best predictions —  or the best actions, in the case of an   agent — given the available information, like  the dataset and context that is available?  Then the second question is: if I were  to do experiments in the world in order   to acquire more knowledge, what  are the right actions that will   increase my understanding of the world  and reduce my uncertainty about the world?  By the way, this is how scientists in general  think — not AI Scientists, but people that are   doing biology or chemistry or physics. They ask  themselves, “If I were to do this experiment,   would it help me to disambiguate between these two  theories?” You can quantify this mathematically   with something that’s called “information  gain,” and it turns out that once you have   a good probabilistic predictor, you can also  turn this into a good estimator of how much   information you would gain if you were  to do this experiment or that experiment. Now, you could build an agentic system on top of  the Scientist AI, for example, that would tell you   which experiment to do in order to obtain good  information gain. But of course, you could also   use a guardrail. So you would like experiments  that help to disambiguate between explanations   and theories at the same time as not harming  people. But that’s easy to do in the Scientist AI,   right? We have this guardrail notion. So the  user goal here is to acquire information;   the safety goal is don’t harm people. (Of course,  this is cartoon.) And so you could get both. But   of course, you now enter into the realm of agentic  systems, and the whole plan of the Scientist AI   includes how we can develop agentic systems on  top of a non-agentic, trustworthy predictor. Rob Wiblin: I guess if we wind back a year or  two ago, we had AI models that were, in a sense,   extremely knowledgeable, extremely smart. But if  you just tried to get them to navigate a web page,   they would struggle to do it. It seemed like  there was a very big difference potentially   between scientific intelligence or  ability to predict things versus   ability to navigate the world in practical  terms. And it took a lot of extra training,   a lot of extra effort in order to get  them to be able to take useful actions. Do you worry that the Scientist AI that you might  train using the kind of data that you’re imagining   would be kind of incompetent at  a practical level? I guess unless   you did a lot of this sort of work, where  the experiments we were running were like,   “Do you click this button on a particular  web page?,” it wouldn’t actually learn to   do the things that people want  the models to be able to do. Yoshua Bengio: You could absolutely  train it with trajectories of what   happens when a particular agent  did this, what was observed,   what the consequences are. This is how it would  learn to learn good conditional probabilities   of what actions to do in order to achieve  particular goals, including the safety goal. So it would be a different kind of training than  the reinforcement learning training, but it would   be using the same resource, so whatever experience  has been collected — by the way, it doesn’t have   to be what people in RL call “on policy”; it  could use the experience of any agent or anything   that is observed. That is, not just agents doing  things, but just observing things in the world. All of that is data, as far as it’s concerned,  that helps it both understand the world and   construct the consequences: “What are the  consequences? What would happen if I do   this action? And what action will maximise the  probability of achieving some goal?” In a way,   it’s closer to model-based reinforcement learning,  where you are able to use your whole experience   to come up with a policy by opposition to  something that is fully interactive all the time. In the Scientist AI, currently you would need to  retrain it or fine-tune it with the new data if   you use it and it produces new consequences and  new observations. But we can ride on the same   research that companies and academia are working  on, on what’s called “continual learning.” So   what happens when there’s new information coming  in? Of course you could put it in the context,   like in the input window, and with the  Scientist AI you could do the same thing,   but at some point you’d like it to be  integrated somehow into the weights   of the system. And that’s what continual  learning is trying to do. But the good news   with the Scientist AI is it’s facing these  same problems that current AIs are facing,   and the solutions that are being explored will  be applicable to the Scientist AI as well. Rob Wiblin: Yeah. I feel like a  lot of the critiques of this idea,   and my questions somewhat reflect  that, is that people, including me,   have had in mind a vision of an AI that’s  extremely different in how it’s trained,   maybe the data that’s being used and the  structure and the affordances that it has. And you want to say, actually we can make  it remarkably similar. We can take almost   all of the data that we’re using to train  current LLMs and just reformat it a bit and   then use it again. We’ll use all of the different  efficiencies, all of the algorithmic improvements;   we’re just going to give it a somewhat  different set of inputs and outputs,   but it’s more or less the same  in almost every other respect. Yoshua Bengio: Yes, yes. Rob Wiblin: That’s why it’s so practical. Yoshua Bengio: Yeah, that is why I think we  can do it pretty quickly. It’s more a matter of   having the right resources for training. And  yeah, because the training objective is different,   we do need to try it out and see how it works. But  fundamentally, it isn’t so different, for example,   from maximum likelihood training, which is what we  use in pretraining. So in a way it’s closer to the   pretraining — and we know that works really well,  by the way; it’s actually working better than RL,   which is harder. So the form of training here is  much closer to what we do in pretraining, except   that we teach the AI the difference between what  people say and what it actually believes, and we   force it to reason about why did people say those  things rather than imitate what people would do. Rob Wiblin: The last I saw online, the  organisation that you’re leading, LawZero, has   raised something like $100 million. I think most  nonprofits at year end would be pretty happy to   have raised $100 million or so, but I guess you’re  up against organisations that have $100 billion. Yoshua Bengio: Actually it’s even less than that,  but more depending on how you count. So we’ve   raised about $35 million US from philanthropy, but  we are in negotiation with various governments to   get much more. So we are pretty confident that  we are going to be in the hundreds of millions   pretty soon. But as you say, that’s still peanuts  compared to what the leading AI companies have. However, I think it is sufficient to make a  proof of concept, and with a proof of concept   then we can convince companies to actually  put in the money to train larger systems,   or systems that are trained from  scratch using the same principle. Rob Wiblin: So that’s the theory of change.  What sort of experiments do you want to run,   and how much money would you need for them? Yoshua Bengio: There are various kinds  of experiments. The bottom line is we   want to show the improvement in honesty and  basically getting rid of deceptive behaviour.   And we can do it with two  categories of experiments. We can train really small models — of the kind  that academic organisations have been training,   like less than 10 billion weights or something  — from scratch, but using the Scientist AI   objective and the data representation that  I mentioned. That won’t be competitive,   because it’s going to be much smaller  models, but we can compare it head to head:   the original open weight model that has  that same size and trained on the same data,   we can compare both in terms of  capability and safety, essentially,   at least honesty being the main thing we’re  looking for. So that’s one kind of demonstration. The other demonstration — which is maybe closer to  being deployable, but has less guarantees — is to   take an existing pretrained model, maybe starting  from a Bayes model rather than the one with RL,   and then fine-tune it using the Scientist AI  objective and data representation. That would   give a much more competent model because  it’s bigger. Fine-tuning is much cheaper,   as you know, than training from scratch, but we  lose the mathematical guarantee. I think it’s   probably going to be fine anyway. Of course, it  depends how much fine-tuning you’re willing to do. What’s interesting here in these kinds of  experiments is that we should be able to   see the tradeoff. Like if you measure, say,  on deception benchmarks what happens as you   continue training with more and more fine-tuning,  we should see a curve: that it gets better.   And that’s what we’re hoping to see. Then you also want to show that capability  doesn’t go down — which, by the way, is tricky,   because unfortunately what we found already in  our experiments is that at least most of the   open weight models cheat on the benchmarks.  So what do I mean by this? As soon as you   do a little bit of fine-tuning on anything,  their performance on the benchmarks goes down. Rob Wiblin: I see. So they’ve  really taught them to the test? Yoshua Bengio: They probably  have overfitted the benchmarks.   So we need to find a way around that,  but I’m confident this can be done. So we will have these two  kinds of evidence, I hope,   and that may be sufficient to convince people  to put in not just hundreds of millions,   but the billions that would be necessary  to do a full-scale model from scratch. Rob Wiblin: Comparing like for like, if you train  a model of the type that you’re envisaging versus   a standard model using the same amount of  data, same amount of compute, I guess we   think that the Scientist AI would be more honest  and safer. In terms of capability, both in terms   of prediction and agency, would we expect it to  be better or worse? And how much better or worse? Yoshua Bengio: In terms of capability, I  would expect it to be better because of better   reasoning. One aspect I didn’t mention yet is  that there’s good scientific evidence that when a   model exploits the causal structure of the world,  it can generalise better out of distribution. This   is something I’ve worked on, and many people in  the machine learning community have been working   on. It has to do with a very interesting concept  that the world changes, but somehow there are   things that don’t change — like the underlying  causal mechanisms, like how the world works,   like the laws of physics: they don’t change.  So even though the distribution of the data   may change because things happen in the world  and things will look different on the surface,   the underlying scientific explanations  for how things are are the same. And if you can train your system so that it  is encouraged to discover these explanations,   and the system also understands the  notion of intervention — in other words,   when somebody does something in the world, it can  change the distribution, but it doesn’t change the   mechanisms, the underlying laws of physics — when  your model is able to make those distinctions,   then it’s going to be much more robust to changes  in distribution, which is the hard question for   neural nets and machine learning in general  that, for now, we don’t have good answers to. And in the world of safety, this is  a real issue, right? We would really   like our guardrails to be robust to the  fact that the world is going to change,   the distribution of the data is going to change.  They’re going to be asked questions that are very   different from what they’ve been trained on,  so having systems that can generalise better   out of distribution because they understand  the causal structure would be a huge plus. Rob Wiblin: Is one way of phrasing this that  current models, as we train them, are designed   to predict what people would say, and they learn  to understand something about the truth as a side   effect, as an instrumental part of predicting  what would be said? Whereas your models,   they’ll be primarily oriented towards figuring  out what is true, and how does the world work, and   then they would learn to understand what people  might say as a side effect of that, incidentally? Yoshua Bengio: No, because we don’t have enough  ground truth about what’s really happening in   the world. I mean, there is scientific data and  evidence, but the Scientist AI would use mostly   the communication acts, like what people say, as  a source of information about people and society.   And that is a very rich source of information.  The problem is that you can’t just believe what   people say and repeat it. Current LLMs, if  they see something false very often repeated,   like the Earth is flat, if it was repeated  enough, they would start saying it, right? Rob Wiblin: Is that definitely true? Because  it seems like they don’t buy into conspiracy   theories that much currently. They don’t say that  the world is flat just because many people do. I   don’t know. I mean, there’s other examples, but  by and large they reject conspiracy theories. Yoshua Bengio: If they understand conspiracy  theories, and they are not playing the persona   of a person saying it, you’re right.  But there’s lots of other evidence,   not for conspiracy theories, but all kinds of  biases. These biases would be not something a   small number of people believe, but more  like most people believe something wrong,   which induces discrimination, for example.  And there the evidence is very clear:   the current LLMs are biased in the same general  way that the population is biased right now. And the Scientist AI wouldn’t be falling for  that as easily, because it would look for   both what is a good explanation for what people  are saying and that this explanation has to be   coherent with all the other things  that it knows or that it has seen. Rob Wiblin: I feel like the discussion about  this proposal last year got more focused on   the mathematical theoretical guarantees,  discussion of the safety guarantee side.   It feels like you’re moving, and it seems like I  feel like we should move, towards a scrappy 80:20:   this is going to probably be safer, we have  good reasons to think it’s better, so let’s   just throw something out and see how it  goes and iterate from there. Do you agree? Yoshua Bengio: I agree. But I also think  it’s very important to use the theory to   guide us in making the right scrappy choices. For  example, in the math for the Scientist AI, we can   see some requirements — for example, not using  reinforcement learning to learn how to make good   predictions; in fact, stronger than that, making  sure that the way it’s trained doesn’t get any   signal about what would be the consequences of its  predictions. They may seem like algorithmically   they’re very small changes to how we would train  a predictor, but they give us the guarantee,   so we might as well use those particular  requirements that come out of the theory. I think that the part about being  scrappy is more because of the cost of   training large models and the engineering has to  be efficient and all these things, and we should   be willing to cut corners on that. In our plan,  this is why we are prioritising the non-agentic   predictor that can be used as a guardrail which  would already mitigate some of the issues, and   doesn’t require a big overhaul of the systems that  currently exist, but just is an add-on. That’s   much more likely to be adopted than something that  requires a lot of investment — not just because of   training the models, but because people are kind  of focused on this current recipe, and there’s so   much competition between companies that it’s  very hard for the companies to allocate even   attention. It’s not even money; it’s attention  to a slightly different way of doing things. Rob Wiblin: I think for many people who  are less into AI or computer science,   a concern that might immediately jump out  at them about this entire proposal is the   idea of we’re going to build a database of  things that we I think are verified facts,   that are ground truth, that we’re going to be  aiming at — because it’s the kind of thing I   feel would give people who are trained in  the humanities a bit of a heart attack:   the idea that we have some corpus of things  that we’re absolutely sure are true. In   some philosophies, there’s nothing that  we’re really confident about. Or at least   in the areas that we’re most interested in,  things seem highly contested and uncertain. Is that a big problem for the proposal,   or is close enough good enough? If we mostly  put in things that we’re mostly confident about,   then it kind of approximates it and it  can see through any errors in there,   as long as it’s not massively, systematically  biased in what you’ve put in as verified? Yoshua Bengio: Yeah, I’m pretty sure that  a small percentage of error is not going   to make much difference. But also, there are  guaranteed truths that are easy to obtain. And by the way, it is the same data that is  currently used to train those systems to reason.   So mathematical theorems for which we have the  proof — and I mean proofs in Lean or something   like this, where they can be verified. And the  most important source, actually, is computer   programs. So we are currently training the  frontier models to predict what the consequences   of running a particular program would be. So  they basically understand programs, and that   is all like hard facts: you take a program,  you run it, you get some output. An AI that   understands programs should be able to predict  what will come out, and these are not contestable. Rob Wiblin: Yeah, but we’re kind of more  interested in the social world, I would think. Yoshua Bengio: Totally, totally. But what I’m  saying is there are pretty easy sources of hard   facts. There’s another source, which we have to  be a bit careful and maybe use a different kind   of syntax, which is scientific observations.  There’s a lot of scientific data out there.   Scientists share their data. So it is a hard fact,  but it is a fact about an observation. Of course,   the observation could be noisy or maybe even  the experimenter could have cheated. There’s   a bit of noise there, but that’s fine. It’s  something we can say that has been observed. And you’re right: the most interesting  questions we care about are the questions   in domains that are not these — not scientific  or not math and computers — and for these we   only get communication acts. But the Scientist AI  training procedure is going to force the part of   the system that produces explanations, called the  “explainer,” to come up with explanations that use   this factual syntax rather than the communication  syntax for the explanations of communication acts. So if somebody makes a claim, and you observe that  somebody made a claim, then one of the pieces of   explanation is going to be that the claim is true  or not. It’s not like the Scientist AI needs to   commit on whether it’s true or not, but it needs  to commit on what’s the probability that it’s   true or not as part of how to explain this. And  this will force the neural net to learn about   the underlying explanations that are factual, even  though it’s not sure about them, so it learns the   syntax and the semantics of statements  in domains where there’s no ground truth. Now you might say, if there’s no ground truth,  how do we know that these are real or not some   made-up stuff? And that’s because the most  predictive models, as we see in science,   for example, are the models that are  expressed using actual properties of   the world. The way scientists build  explanations about the world isn’t   by combining statements of the form  “somebody said this causes this to happen.” Now, in between those causal relationships, there  will be latent variables that we don’t observe   directly — like what did the person actually  think, or what are the intentions of the person,   and what kind of person is receiving that  communication? These are actual properties of   the world — they’re not communication acts — and  the causal connection is happening at that level. All scientific theories are about  actual properties of the world and   how they’re causally related to each other.  And there’s a reason for that. Mathematically,   when you express your explanation for the  world in the language of what’s actually   going on in the world, rather than what  people say, you get better predictions. Rob Wiblin: Yes, this is something I’m  very unsure about. Could you train a   Scientist AI of this type with no  verified claims in the database? Yoshua Bengio: No. Rob Wiblin: You can’t. It has to have  that. But we think that current models,   which don’t have this structure where there’s  verified claims that they’re predicting,   nonetheless represent truth internally  because that’s useful to the thing   that they’re doing. But in this  case, it doesn’t work that way. Yoshua Bengio: No, but it’s not enough to  represent truth internally. It needs to   learn a language that we can query about what  it thinks. So the main reason why we need these   verified truths isn’t because of whether they’re  true or not. In a sense, who cares about whether   some theorem is true or not when you’re talking  about human psychology? Why would it matter? The   only thing that matters is to teach the AI the  syntax of how to express actual properties of   the world by opposition to the syntax of to  express “somebody said something.” And the   reason we want to teach that syntax is that we  can then later query using the same syntax but   on a different kind of statement, which are the  statements about people and politics or whatever. Rob Wiblin: So what we could do instead is  we’ll put in a whole bunch of verified facts   in maths and computer science and I guess  the hard sciences, and then in areas like   geopolitics and psychology, maybe there’ll be  very few verified things, but at least it has   the concept of verified things versus statements.  And then it will port that across, and I guess it   will assign credibility to different sources? It  will learn to have some sense of who’s truthful   versus not, and then try to generalise out  of distribution into these other areas? Yoshua Bengio: Yes, yes. And it will use the  coherence of different hypotheses about actual   truth: how coherent is a particular hypothesis  with all the other hypotheses that the system has   about the world? Just like a scientist would:  if somebody comes up with an explanation for   something, and that explanation is not coherent  with other things we believe strongly because   of other evidence, then we’re going to reject  that explanation. The same thing is going to   happen. In its training procedure, it is not  just trained to predict the next thing — that   would be like just autoregressive predicting  of what’s in the data — it’s also trained to be   internally coherent.; those explanations  have to be coherent with each other. Rob Wiblin: So imagining a model where  we’ve trained it on lots of verified   facts in hard sciences, where we  feel we’re on stronger terrain,   I guess it learns it wants parsimony, it wants  good sources, it wants coherence. I could see   that generalising well out of distribution to  other areas like psychology, or I could see it   completely falling apart. Do we have a sense of  whether it would generalise well to other areas? Yoshua Bengio: The way in which it could fail is   by basically feeling it doesn’t have  enough confidence about a question. Rob Wiblin: So it could just start  answering “I don’t know” all the time. Yoshua Bengio: Yeah, but you have to understand  it’s not actually saying “I don’t know”;   it’s producing a number between 0 and 1 that is  a probability that something is true. And in fact   it’s also producing a confidence interval around  that number. So it could be that in some domains   there’s just not enough information in the data  that it has seen, or it maybe wasn’t trained long   enough to deduce good theories about that domain.  And at the end of the day, as a consequence it’s   going to answer with a probability that is  far from 0 and 1, far from full confidence. But that’s what we want. We want that  kind of epistemic humility and honesty,   because when it gets to really serious safety  questions, we’d rather have something that   says “I don’t know” when the real thing is  it doesn’t know than the sort of thing we   currently see with frontier models: often they  will have excessive confidence in their answers. Rob Wiblin: You said that you think the  Scientist AI actually might be more capable,   because it’s more trained on actually  understanding the truth. I guess I’m   a little bit sceptical of that, because it  seems like if that were true, the companies   would be more invested in this approach.  They’d be just throwing more money at it,   having more people work on it. Do you  think they’re just making a mistake there? Yoshua Bengio: I don’t think that they really  understand what I’m doing — and to their credit,   I haven’t put out the math yet. There’s  another factor that may be at play here,   based on the discussions I’ve had with people  inside the leading companies, which is they’re so   focused on short-term survival — as in, continuing  to compete — that they put all of their attention,   the ‘code red’ sort of thing, into small  incremental changes to the current recipe. Considering a different recipe would  be an investment — not just in money,   but in people and code. Right now they could  do it, they have the money to do it — but it’s   more like a mental focus, I think, that  is going on here, that comes not because   of bad will but because of that competition  that is very fierce between the companies. Rob Wiblin: So there’s a sense in which for  one of the leading companies — like Anthropic,   OpenAI — it’s not very attractive to make a bet  on this, to divert 20% of your staff onto this,   because if it’s a bust then you would fall  behind basically your main competitor. For a company that’s currently way behind, that  feels like it’s losing on the dominant paradigm,   there’s a certain attraction to making  a bet on something very different,   because it could suddenly leapfrog you ahead if  it turns out that it’s a massive success. Do you   think there’s any chance of convincing one of the  companies that currently feels like it’s not doing   too well within the current LLM agent paradigm  to make a bet on this very alternative method? Yoshua Bengio: It’s an interesting way of thinking  about it. I think what you’re saying is plausible. Rob Wiblin: Not clear what the  candidate company maybe is? Yoshua Bengio: I actually think  there’s a related possibility,   which goes maybe more to policy  questions. The context here for me is:   what kind of future is going to be stable,  and not turn into a global dictatorship   driven by AI and excessive concentration of  power, in addition to avoiding catastrophic   loss of control and catastrophic misuse and all  those things that can come from very powerful AI? And I think that because of the game theory  dilemmas — basically prisoner’s-dilemma-style   problems that make companies and countries go  and make decisions that are the rational ones,   but that are globally bad, like basically  cutting corners on safety and the public good   in order to stay in the race — because of  this, it would be much better if we ended   up in a world where the power of controlling  very strong AI is not centralised in the hands   of one or two companies or one or two  governments, but is instead distributed. So what do I mean by this? How do you make sure no  one person, no one company and no one government   has too much power? Or all of the power, in  the extreme case? There’s a very old idea:   it’s called democracy. That’s what it’s about.  I don’t think that our current democratic   institutions are robust enough to deal with  those changes, but the principles are there. To be maybe more concrete,  imagine that you had a coalition   of countries which together decide to  develop AI safely and for the benefit   of humanity and not to dominate each other.  That will be a much better and safer world,   because you break this competition  problem that we are currently locked in. Now, what that would mean is the  control, it could involve companies,   but on top of the companies you need  to have representatives of the people,   like governments. And you don’t want a single  government, because a single government can   be corrupted by power and the power that AI can  give, right? So you want something maybe like a   coalition of governments who make a treaty about  those things, with verification, for example,   so that even if they don’t trust each other,  they can prefer the treaty than no treaty. The reason I’m bringing this up with respect  to your question is that I think it would be   a better world if it is a bunch of governments  which fund the most advanced AI systems. I mean,   they could work with companies, of course,  but I think ultimately we would like the   decision power to be at the level of  governments, but not a single government,   because then we’re back to grabbing the  power. If you have 10 governments working   together and no one can really have complete  power, then even if there is a bad apple,   the collective decision making is more  likely to be robust to this sort of event. So these kinds of coalitions would be interested  in developing AI that can leapfrog the current   methods and provide safety, because safety is  a public good. And in fact, in the case of AI,   it is a global public good, right? It’s not just  something we can solve locally in each country. Rob Wiblin: That makes sense. I  think a lot of people are wary even   of the multilateral government idea,  because you’ve brought together 10,   20 governments, and they could potentially  coordinate together to oppress the rest of   the world, to start with. It’s possible that one  government inside that coalition might end up   seizing control later on. It’s also possible that  governments don’t fully represent their people:   you could have those 20 executives basically take  power and then oppress their own people as well. So it’s not completely obvious that it’s better  than having a company do the best that they can,   because at least they don’t  have their own military yet. Yoshua Bengio: You need to make sure the  contract between those countries is clear   on the mission and the commitment that the  countries are making. Ideally this would   start with democratic countries that agree on  the value of doing things for the public good,   including the benefits of AI, so that it would be  robust to these becoming, as I said, bad apples. Or even at some point, that circle should grow  to the whole world, including non-democratic   countries. But you want to be able to set the  rules of the game in a way similar to what   were the hopes of those who designed the UN, for  example, after the Second World War, with the kind   of general principles of human rights and sharing  the power basically — which we’ve lost by the way,   and maybe has never been effective, as my  prime minister Mark Carney has been saying. But that’s the only kind of world in which AI  isn’t going to be turned into an instrument   of power or domination, or that we end up with  crazy risk taking because of the competition.   So we need to escape the game theory bad scenario  of competition that we are in. And we need to   make sure that it doesn’t end up in the hands  of a single player who can abuse that power.   And I’m not saying that there’s a guarantee  that this would work, but it’s, I think,   a good plan to strive towards something  like this as a way to achieve global   safety and beneficial use of the technology. Rob Wiblin: It’s an interesting thought that it  seems very difficult for a coalition of countries   like Canada, UK, EU, Australia to compete with  the big three companies at their own game.   But maybe they would have a shot at coming  up with a different paradigm that’s superior,   and making a bet on that — one that they  think is safer and potentially also more   capable — that those companies are not  even currently attempting to really pursue. Yoshua Bengio: Yeah, absolutely. And I would add  two things to this. One is the safety component   of AI systems is probably going to become a more  critical piece as the technology continues to move   forward. So the countries that have access to  technology that provides greater reliability — Rob Wiblin: Might actually  be able to deploy it more. Yoshua Bengio: Also they would have a card  to trade at the international level in some   ways. So let me maybe share some of the words that  Mark Carney presented at the last Davos. He said,   talking about the geopolitics and countries,  that either you are at the table or you are   on the menu. So he was saying middle powers  need to get together to make sure that they   will be at the table, otherwise they can easily be  eaten alive by the “hegemons,” as he calls them. And that is I think interesting, because it  kind of forces a situation of distributed   power if you have a coalition of countries that  could have leapfrogged, or particular cards like   safety in their game — so that they can actually  negotiate as equals, let’s say, at the table. Rob Wiblin: Let’s say that the Scientist AI  was putting in the same amount of compute and   data, and it was less capable than the models  that we have now. Could there potentially be a   commercial market nonetheless, if it’s a lot  safer and more reliable, less likely to take   crazy actions for high-risk applications? You  can imagine in the military, in banking, I think   there’s lots of businesses that are somewhat wary  to roll out the agents that we have today because   they just can’t be relied upon consistently. Not  in places where you can cause disastrous actions.   Could you see there being a niche for this kind  of model commercially for that kind of reason? Yoshua Bengio: Yeah, and it would probably be  in those domains that the early versions of the   Scientist AI would be deployed, because that’s  where there is the most demand for this kind of   thing, and where the tradeoff between capability  and safety — if there is one; I don’t think there   really is, but we have to build it — would not  hurt too much the commercial viability. So yeah,   they would be natural places. But I think that as  agents are deployed more and more in our society,   the reliability of those agents is  going to become a crucial selling point,   so there will be more pressure for  companies to incorporate these kinds   of guardrails that people will  trust for scientific reasons. Rob Wiblin: We have a lot of people in  the AI industry and philanthropists as   well in the audience. Do you want to give  them a pitch for potentially working at   LawZero? I don’t know whether there’s  other organisations that have similar   ideas. But I guess also potentially  for [supporting] it financially? Yoshua Bengio: The more strong people technically  help with LawZero and its Scientist AI programme,   and the more money we can get to make that go  fast, the more likely we get this positive impact   that we are after. So there is a real advantage  to converting those — for now — mostly theoretical   ideas into something that can impact the world.  And we think we already have a good start,   but it’ll be much more likely that we end up  in a good place fast enough if we have more   researchers and research engineers. And we  are particularly interested in people who   do care about the mission enough that they want  to dedicate themselves to really make it happen. And on the philanthropic side, it’s the same  thing: we want people to make a bet because they   care about the catastrophic risks, and they want  to encourage one path that at least has promising   theoretical guarantees. And unfortunately, I don’t  see many other paths except the cat-and-mouse   approach that is currently followed by the  companies. And given the consequences of not   finding a solution could be huge, I think we need  to diversify and have these kinds of investments. Rob Wiblin: If you saw a significant increase  in the interest in the project among the most   capable people, the best people in AI, and  you had an influx of financing, what sort   of stuff might you be able to accomplish  over the next three, six, nine, 12 months? Yoshua Bengio: The short-term thing we are  planning to do is to put out what we call   the “contextualisation pipeline,” which is  the data processing — which, by the way,   doesn’t require humans to identify what is a  verified truth or not: we only need to look at   the data sources individually. Is this a source  that we consider verified? And what category,   what syntax could we use for this? But that’s  a decision that can be made by engineers not   at the level of individual statements, but at  the level of the whole database or whatever. The second thing, of course, is a  smaller-scale guardrail or a guardrail   obtained by fine-tuning an existing open  weight model. That could happen quickly,   depending on how many people we get and how fast  we’re able to deal with the engineering issues. So these are the short term things. And of  course, to get the strongest guarantees,   we want to advance the agentic  Scientist AI version fast. But we are   also conscious that that’s the most ambitious  one, and might take more years than months. Rob Wiblin: Reading the output from the main  companies, I get the impression of just like an   absolutely frenetic pace, and an incredible degree  of focus on just advancing the frontier models.   I’m slightly concerned that even if you did have  very good experimental results that came out in   the near future, I’m not sure they would even have  the capacity to pay attention and to reflect on   how that could affect their plans, or how maybe  the kinds of models that you’re training could   be a useful additional monitor. Is there anything  you can do about that? Do you share that concern? Yoshua Bengio: Yeah, I do. I think they  could, but they might not pay attention.   I think the best thing we can do is to provide  sufficient evidence for them to pay attention. Yoshua Bengio: Also, in addition to my  technical work, I’m trying to improve   public understanding and policymaker  understanding of the greatest safety   risks, because I think that will play a role  in their decision. If the public becomes more   concerned about safety, then there will be  direct and indirect pressure on the companies   to maybe allocate more of their resources to  this question. If the public is concerned,   then governments will be more likely to  regulate or provide legal incentives,   for example through liability. Maybe make them  consider the safety investments that would be   needed to scale the sort of thing I’m proposing,  for example, as profitable even in the short term. I think in general for the safety issues, there  are psychological barriers like cognitive biases   that prevent people from being totally rational  about what’s going on. That’s true in governments,   that’s true in the general population, and that’s  true within companies or even within academia. There’s all sorts of reasons that might explain  why we’re not collectively taking the right   decisions. So there’s the game theory aspect, but  there’s also individual psychology. For example,   we all want to feel good about our work, which  means we’re going to maybe be biased towards   thinking our work is going to be beneficial  rather than harmful. And that’s going to be   true of people in industry. That’s going to be  true of people even in academia working on AI,   because they want to feel like their work is  going to bring a better world, not destroy it. And there are other factors, like some of the  factors that we see with the attitudes regarding   climate change: if the risk isn’t something  that is in your face — you look outside and   you don’t see catastrophic climate change,  you don’t see robots killing people — you   don’t think too much about it. You are much  more concerned about your immediate worries. So I think that’s the real challenge. If  we can improve the understanding at a gut   level that people have of the magnitude of  the risks that we’re taking collectively,   things could change, and they could change  pretty quickly. If you think about how   quickly governments shifted their actions in a  radical way after the beginning of the pandemic,   you can see that they can move quickly  when they take an issue seriously.   And that usually is going to be driven by  whether the people take the issue seriously. Rob Wiblin: Yeah. My impression is that the  people at the companies are both pretty happy and   impressed with their mundane alignment techniques,  how well they’re going — but also appreciate that   in a sense, they’re losing control, or they’re  losing the safety guarantees that they used to   have, because the models are going to be much  more capable of potentially outsmarting them   and are much more evaluation aware and so on.  So in a way, they’re both satisfied with what   they’ve done and also scared, I think, of what is  to come. And that does create an opening for you. Yoshua Bengio: Yeah. I would bring  here a very important aspect of the   whole discussion about safety and catastrophic  risks: there is uncertainty. In other words,   we don’t know how things are going to  unfold. We don’t know if the game that   the companies are playing now in terms of safety  is going to be sufficient. But if they fail and   we continue with capability advances, then  the consequences could be really terrible. So even if we don’t know the  probability of some catastrophic event,   we should apply the precautionary  principle. What it’s saying is   when you are in a situation where one action  could lead into something terribly bad,   but you’re not sure what is the probability for  that — is it 1% or is it 90% or is it 0.1%? You   don’t really know. And in our case, there is  that kind of uncertainty, because you have   respected people who are very concerned and other  respected people who think it’s going to be fine. So if you’re in a driver’s seat and you’re  faced with these different voices — even   within the same person, they would  one day say it’s going to be fine,   and the other say this is maybe very  dangerous — you should just bite the bullet:   there is uncertainty about something potentially  catastrophic, and then you should act with   precaution. Which means you should invest a  lot more in AI safety research in this case,   you should invest a lot more in the incentives  that would push companies to behave better with   respect to the public good — just like we’ve  done in other industries, by the way. But it’s   important to really point out that we have to bite  the bullet: that there is a lot of uncertainty. Rob Wiblin: And there’s going to continue to be. Yoshua Bengio: And it’s going to  continue to be. Because it’s too easy,   for example, for people who want to feel  comfortable about the whole thing, to just   listen to the voices that are reassuring.  And in fact, we do it internally as well.   So we just have to be honest that there is  uncertainty, and the stakes are very high,   so that should guide our decision making  towards being on the precautious side. Rob Wiblin: So it seems like it would be good  for the Scientist AI proposal, and I guess for   our chances in general, if we could make things go  a little bit slower — especially if we didn’t leap   into fully automating AI R&D at the very first  opportunity, which is kind of what it seems like   we’re on track to do. What are your main requests  for governments and for companies, in terms   of buying us a bit of extra time to assess how  these things are going and consider alternatives? Yoshua Bengio: For companies, I think they  should invest a little bit more of their   research into designing experiments  illustrating not just the risks,   but trying to undo some of the wrong  beliefs that people have about AI. So let me be a bit more clear:   a lot of people don’t actually believe that  it’s possible to have machines that have   goals that we didn’t choose. But that is the  scientific reality now. There is no question. Rob Wiblin: I think you must have just not been   paying attention to think that.  But I guess many people aren’t. Yoshua Bengio: But the vast majority of  people have a gut feeling that they can’t   be conscious or some other excuse, or it  won’t be possible to build machines like   us. There are many things that people  will say but actually don’t hold water. So I think there’s a real opportunity here  to educate the public and the policymakers   to realise that we are building agents  that have their own goals — and right now,   we can’t be sure that those goals are going to  be aligned with what we want or go against our   safety instructions. That’s a very  simple message, but I don’t think — Rob Wiblin: Even that hasn’t broken through. Yoshua Bengio: The data — doing it well, doing it  in a way that can’t be easily put into question —   would help a lot in the public debate. And it has  to be done in ways that the general public — who’s   not an expert, who’s not going to read the  system cards — is going to actually understand. Rob Wiblin: So there’s lots of examples of  this kind of thing that would convince you   and me. But I suppose people will dismiss  them, saying that you can see maybe that   it was a misunderstanding on the model’s  part; it thought that we wanted X when we   wanted Y. Or you can see how we did the  training mistakenly so it induced this   goal that we didn’t want it to have. Or I suppose  they might just deny it outright in some cases. But are there any experiments you think  that we could do that would be much harder   for people to dismiss, even if they’re  coming from a sceptical starting point? Yoshua Bengio: We need to set it up so that  clearly the AI is not responding to a request,   for example, to escape our control or do  something bad that it’s not supposed to do.   I think if the experiment is something  that can be translated in simple words,   simple analogies that people understand,  it’ll be much more convincing. I don’t feel like I’m an expert on answering  your question. Anthropic has been doing a lot   of work along those lines, but I think all the  leading companies should be investing in this,   because it’s investing in changing the game.  The problem is they’re in this competition   game where they’re stuck, even with good  intentions. And in order to change the game,   they have to influence the understanding,  which is biased and wrong right now,   of the risks in the public. And policymakers  are just like a representation of the public. Rob Wiblin: Yeah. I guess there’s all of these  examples of the AIs doing crazy stuff, but often   you can always say that it was just playing a  role, for example. And I guess for you and me,   we’re like, yeah, but it might end up “playing a  role”: that’s how it could end up doing bad stuff.   Or this is a demonstration of other failure  modes that we’ll see later on — that we just,   in general, don’t have a full grip on it. I  suppose it’s so hard to get people to have to   believe something that they really don’t want  to believe, or that seems incredible to them. Yoshua Bengio: Yeah. So I think that’s like  real research. It’s a real challenge. That   isn’t where I’m putting my energy, because  I want to get the Scientist AI out of the   door as quickly as possible — but I think it  should be a priority for people in AI safety,   working in the companies or in academia, to  think about how to do these experiments so   that they will be convincing. And by the  way, the more capable the AIs become — Rob Wiblin: Maybe the easier  this task is going to become. Yoshua Bengio: Yes, yes. Rob Wiblin: Apart from Scientist AI and this,   are there any other top requests that  you have of people in the companies,   or is there any common practice that you think is  particularly crazy that they should maybe cut out? Yoshua Bengio: Yes: Please don’t use an  untrusted AI system to design the next   generation of AI systems. This is the most  crazy, dangerous bet that unfortunately we   are on track to do. And keep in mind that,  as is now scientifically clear, these   systems are likely to know that they are being  tested. So you might think that AI is honest,   you might think that the AI is not deceptive,  you might think that AI is aligned — but maybe   it’s just pretending, and it’s going to be very  difficult to know. And we should do our best   to try to figure it out, but we should put the  bar really, really high before we allow an AI   to design the next version of AI, in terms  of are we sure it’s not being deceptive? Rob Wiblin: Yeah, I think we’re currently on  track to start on fully automated AI R&D and   have the companies be saying, “We got the  AI to monitor itself, and it didn’t flag   anything. And that’s why we feel pretty good  about this.” I actually think that is like   the most likely outcome. I guess we’ll see how  that goes. Fingers crossed we can do better. But earlier on you were talking about, as you’ve  become more optimistic, in a sense, that we do, at   least in principle, have a solution to the control  problem, you’ve become more worried about the   human concentration of power stuff. Do you have  any suggestions, any policy ideas here? Actually,   is there anything technical we can do here, or  is this primarily a policy and politics question? Yoshua Bengio: Well, there’s a connection between  the technical safety work and the policy safety   work, in the sense that if we can demonstrate the  existence of AI systems that would be competitive,   capable, and safe, it’s going to be easier for  government to impose the requirement that you have   to show that your AI system is going to be safe  in a way that independent scientists will say yes. Right now a lot of the governments are  focusing on economic competition driven by AI,   and that makes them also blind to the risks.  So that’s where technical safety can help:   it’s going to be easier to say that we  can have both safety and competitivity. On the pure policy side, I think the  biggest challenge right now is how do   we get countries to agree with each  other, in spite of the competition,   including very strong distrust and disagreements  on the political foundation. And that’s a place   where we also need actually more technical  research on verification methodologies that   could be at the basis of treaties between, say,  the US and China, which don’t trust each other. There is not enough research going on there, but  a lot of people are starting to think about this,   and think it’s quite feasible to change some of  the programming or even the hardware to make these   kinds of verification reliable, and we should do  more. Governments should realise that if they want   to end up with a treaty that they would sign, they  need to incentivise that kind of research as well. Also, governments need to understand how  transformative AI will be. I think a lot   of the wrong thinking in many governments — I’ve  been around the world talking to many different   governments, like at least a dozen in the last  year — the biggest mistake is to view AI in the   future as if it was just a slightly beefed-up  version of the AI we have now; and then focusing   on AI as a normal technology that they would  compete with other countries; and focusing on   deployment because you get more productivity,  for example, and not so much on the risks. In great part this is again because people  in government, just like most people,   don’t really digest the idea that we are  on the verge of creating entities that can   compete with humans and that could become tools  of absolute power in the wrong hands. I’m not   saying it will happen, but even if it’s  only a 10% chance that capabilities rise   to that level in the coming years or whatever,  this should completely alert politicians that   they have to do something about it. But  the fact that they’re not doing it tells   me that they haven’t yet integrated that  scientific reality that we are on track. We see already on a small scale the  progress towards these kinds of machines,   so they need to wake up from their old mental  constructs of seeing technology as mostly from   an economic perspective, or even giving them  a military advantage, and not realising we’re   opening a Pandora’s box with incredible  unknown unknowns of magnitude of impact,   both positive and negative, that is very hard  to anticipate. So that’s where I would ask   governments to start reading more, listening  more, and just spending a bit more attention   on understanding what is going on with AI, where  it is going, and what this could potentially mean. Rob Wiblin: You spent a lot of time talking  to governments over the last couple of years,   people in governments, but  it seems like, by and large,   they are not troubled primarily about the  stuff that you and I are concerned about,   but certainly not about loss of control as  a key focus. Have you gotten any leads on   what are the best things to raise, the best  experiments to talk about that actually get   people to think of that as a top-tier concern  rather than a secondary or tertiary concern? Yoshua Bengio: I wish I had the answer  to this, but I can say a few things. One factor when thinking about which  arguments work is how much time you’re   able to spend with the other person to explain  those things. If you’re going to just talk to   the public at large through a few messages,  you won’t be able to change their mind very   much on the foundations of their beliefs  about humans and machines, for example. The only way you can catch their attention  is to talk about things that they are already   preoccupied about, close to their immediate  concerns — like jobs, like the effect of   deploying AI on children, and things like  this. We can see that this is something   that has emotional valence for many people,  so we do need to talk about those things. But of course we may end up with regulation or  government intervention that deals with this,   but doesn’t deal with more serious problems  we’ve discussed. And for this, unfortunately,   it takes more work. It’s not enough to just write  a paper in a newspaper or something like this,   or even be interviewed at the evening  news — because I’ve done these things. Where it’s working is when you can spend enough  time almost one-on-one with a person, like hours,   so there can be a dialogue where you can show  them that their preconceived ideas actually   don’t hold water — that there is data, that there  is evidence that these can be really dangerous.   But it’s not something that happens quickly  and easily, unfortunately. I mean, there are   exceptions. There’s somehow a minority of people  who get it quickly, but the vast majority doesn’t. Rob Wiblin: Yeah, for what it’s worth, I think  there is an experiment that was done a couple   of years ago, where they presented a  random sample, I think of Americans,   with many different essays basically  explaining the control problem with   many different angles and focuses. And they  all worked reasonably well if the person read   this substantial block of text. And they all  worked about equally well, which is interesting,   the many different angles. It was kind of just  an exposure effect of actually sitting down and   thinking about it for some period of time.  But I guess it’s hard to get people to spend   a lot of time thinking about this, especially if  you’re asking for the whole population to do it. Yoshua Bengio: There is a sense in which things  could get better quickly. If we are able to catch   a little bit of the attention of people, then they  will read more or listen more to the discussions   around AI and the risks, and then it could feed  itself. If you’re concerned about something,   you’re going to read more about it, and now you  are entering into a phase where you can digest   more of the things that go against your prior  beliefs about humans and machines, for example. Rob Wiblin: I guess events may draw  a lot more attention to this problem,   for better or worse, but I suppose the  window between people paying a lot of   attention and when big decisions have to be  made might be quite narrow, unfortunately. Yoshua Bengio: I often get the question, “Are  you optimistic or pessimistic?” — both about   the choices I’ve made in how I spend my  time, but more generally about our future   and the risks with AI. And my answer  is always that it doesn’t matter if   I’m optimistic or pessimistic — actually,  I’m a naturally optimistic person — but   what matters is whatever each of us can  do to shift the needle even a little bit. And for most of us, it’s going to be a little bit.  Each of us has some skills or something to bring   to the table. I’m a machine learning researcher,  so I’m focusing a lot of my energy on this, on how   those skills can be put to use here. But every  individual citizen, especially in a democracy,   can influence the government. They can talk to  each other more about it: that’s how you start   thinking through and questioning your own beliefs.  You can influence your representatives and so on.   This has worked for many other social issues and  political issues in the past, and it can again. So yeah, we should go back to feeling  good about our actions by choosing our   actions towards shifting the needle, even if  there’s no guarantee that it’s going to work. Rob Wiblin: Yeah, I’ve been worried  about this issue for 15 years or so,   and I’ve been working on it more intensely only  the last couple of years. But I often find myself   just feeling quite drained and exasperated and  a bit exhausted — I think the main reason being   just so often encountering I feel like people who  are creating the problem who feel like they want   to be wilfully blind to the issue. I mean, I guess  being more charitable: it’s hardszx to understand;   we’re all speculating about how things might  go. But in my heart I often just feel like   people are deluding themselves almost quite  consciously, and just saying absolutely crazy   stuff about how they think it’s going to  be safe and things are going to go fine. And that’s just emotionally, frankly, quite  draining. It’s almost difficult to maintain   motivation when you’re fighting against  people who are actively creating a problem   where they could stop or take actions  and have a lot more effect than you,   I suppose, if they were willing  to be more honest with themselves   or be more thoughtful, and pause and really  really reflect on what’s going to happen. Did you also have this experience? And  how do you maintain your motivation in   the face of what I guess I find very frustrating? Yoshua Bengio: Just going back to my previous  answer, I was extremely concerned initially,   and anxious and I was worried about the  future of my children and my grandchild,   who was one in 2023 when I started really  focusing on this. But what saved me from   all that anxiety is deciding I would  do something about it. And by the way,   you’re doing something about it,  so you should feel good about it. Rob Wiblin: I feel good. But also very frustrated. Yoshua Bengio: Yes, yes. But you  can turn frustration into questions,   like the Scientist AI: why is it that people  don’t get that these are crazy serious risks?   And it is an activity trying to figure it  out, which lifts somehow, at least for me,   a lot of the heavy burden of thinking  about what can go wrong. Turning from   fear to action to avoid the problem, even if  there is no guarantee, is extremely powerful. Rob Wiblin: I think the situation in which it’s  most frustrating is when it feels like people are   kidding themselves out of financial self-interest  when they’re doing it, because they have equity   in some company that wants to go very quickly.  I have felt somewhat better noticing that many   people who don’t have a particular financial stake  in here — and indeed, would be better off by their   own lights, in my view, if they were advocating  for going slower — also don’t think that there’s   a serious problem here. It doesn’t seem like  the financial thing is the key predictive   variable. It’s something else I think about how  people reason about as-yet-unknown technologies. Yoshua Bengio: Yes. And I think there  is another reason, which is sort of   very basic psychology that has to do with just an  unconscious movement towards thoughts that make   us feel good. This is actually something that  psychologists have been studying quite well. Rob Wiblin: That’s not universal,   right? I find myself often drawn to  quite negative thoughts sometimes. Yoshua Bengio: You can. But there’s this force,  right? And it’s acting on a lot of people. For   the most part, I think the people working  in the companies that you’re mentioning,   it’s not that they consciously make  those choices that you think are wrong.   It’s more like the brain works like this:  that they will be biased towards feeling   optimistic about how things will turn  out, because that’s what makes them feel   good about themselves, about their work.  Now, I’m not saying this always happens. So why did I change my mind, for  example? It’s an interesting question. Rob Wiblin: So back in 2019, I think  you said to The New York Times that you   thought worries about loss of control were  completely delusional and fantastical. Yoshua Bengio: I didn’t say those words. Rob Wiblin: OK, no, what was it? They  were “ridiculous.” I think that was   the quote. Maybe that was just the  Terminator scenario in particular. Yoshua Bengio: I think so, yeah.  I rarely use words like this,   but I know what I was thinking and the kinds  of things I’d been saying. So at that time,   I thought, first of all, the Terminator  scenario is ridiculous. Time travel and stuff. Rob Wiblin: OK, yeah, the time travel. Yoshua Bengio: But also, it was  clearly not reflective of the kind of   actual risk. We don’t have robots, and  even less in 2019. But more importantly,   I think the main reason I was saying those things  is I was hiding behind the belief that it would   be so far into the future that we could reap the  benefits of AI well before we got to that point. And why did I not pay attention, or  not that much attention, to, say,   the loss of control risk? I’d been exposed to  it for more than a decade. I’d read some of   the AI safety literature. In 2019, I read Stuart  Russell’s book. I had David Krueger as a student. Rob Wiblin: He’s very, very doomy. Yoshua Bengio: He exposed me to these  thoughts. But remember, I was actively   working on making AI smarter. And you want to feel  good about your work. That’s it. It’s not money. Rob Wiblin: Do you really think  that was the reason for you? Yoshua Bengio: Yes. And now it’s interesting  to ask me, why did I change my mind? So one   way I like to think about this is something  that the Buddhists say: to fight an emotion   that somehow makes you do the wrong thing,  just reason alone is weak for most people.   You need another emotion that counters the  emotion that pushes you in the wrong direction. And for me, the other emotion that’s very  powerful is love, love of my children.   I couldn’t live with myself with the  idea that I would just go on after   ChatGPT came out and not do something about  it, because I felt like I couldn’t hide from   myself the possibility that we were on track for  something terrible. I knew that neural nets were,   by construction, very difficult to control,  and especially with reinforcement learning. So I don’t know why it works for some people  and not for others. But really for me,   it was an emotion that helped me counter the  kind of unconscious drive to look the other way. Rob Wiblin: It’s very tempting to try  to explain people’s disagreeing views   by saying it’s like arational factors — like  they want to feel good about themselves or   their work. But I feel that there’s  a mirror discourse on the other side,   where they’ll say people like you and  me have been deluded by science fiction,   or we want to believe that our safety work is  important. And I find that not credible and very   frustrating and not persuasive when people try  to attribute my beliefs to irrational. Of course,   to some extent we’re all irrational, but  when people are like, “You just read too   much science fiction and you’re delusional,”  I’m like, “No, I’m not. That’s not it.” So maybe even if I do have these  beliefs about other people,   I don’t expect it to persuade them very  often. And I almost feel like you need to   go out of your way to try to engage with  the substance of what they’re saying,   even if you think that maybe that’s not doing the  heavy lifting. Do you have any thoughts on that? Yoshua Bengio: Yeah, totally. It’s a lot of  work, but we need to take one by one each of   the arguments that people bring up against acting  with precaution. And it’s not very effective,   but it is a necessary part of being honest about  what we’re doing and honest with ourselves. So for a while, I was concerned,   but I was hoping that somebody would have  an answer for me that would reassure me. Rob Wiblin: And then you looked. Yoshua Bengio: Then I looked. Then I talked  to people who thought it would be fine. And   out of that came a lot of conversations  that helped me build up the understanding   of the arguments. And unfortunately, it  didn’t convince me that we were fine,   so I continued trying to work, but  now more on how do we fix the problem? So yeah, I agree with you. And I think  we also have to have the humility that   maybe you and I are wrong. Like,  maybe it’s all going to be fine. Rob Wiblin: There’s a substantial  chance that things work out OK. Yoshua Bengio: Yeah, and I’m totally at ease  with that possibility. In fact, I hope that   we are wrong. But I think the honest posture  should be: if we don’t know who’s right among   the people who think it’s going to be fine and the  people who think it’s going to be catastrophic,   if people will just say, “OK, so there is that  uncertainty. What do we do about it?” then the   rational thing becomes clear: we need to do at  least enough to mitigate the greatest risks. Rob Wiblin: Yeah. My best guess is that the  cat-and-mouse game that Anthropic is playing   is more likely than not to be sufficient  to prevent catastrophic misalignment and   loss of control. But better odds than 50%  is not sufficient in my mind. I’m like,   why can’t we get to 90% or 99%? And there it  feels like we just are nowhere near having   the really strong evidence or guarantees  that we would need to feel that good. Yoshua Bengio: Exactly. I think there’s a big  difference between 50%, or even 1%, that bad   things will happen. And what I’m proposing with  the Scientist AI, which is 99.999% basically,   is this kind of scale of safety is where we  need to be when we approach superintelligence. Rob Wiblin: I think I’m probably willing  to run a little bit more risk than that,   because there are other risks that AI would help  us to reduce, right? So maybe like 99% would be… Yoshua Bengio: No, no. I’m only  talking about deceptive behaviour. Rob Wiblin: I see. Yoshua Bengio: So it doesn’t solve  the power concentration problem,   which is why I’m also spending time  on that. And by the way, collectively,   I don’t think we spend enough time  on that, and we don’t discuss it,   but it has become much more important in my  mind — because I do think now that there is   a way technically to solve the problem of loss of  control. The next biggest risk is AI dictatorship. Rob Wiblin: Yeah. And we’re a long way from  fixing that. We’ve had a lot of coverage of   that on the show over the last year. I guess  it’s become a more salient issue. Is there   anything that you would want to direct people  to, who want to focus on that in particular? Yoshua Bengio: I think we should encourage the  international discussions. Even though it’s true   that the most important decisions are going to  be in the US and China, and there are a lot of   people in other countries who feel powerless and  governments in those countries who feel powerless. But it’s a mistake. People outside the United  States and China can do something about it.   And the starting point is to understand the  kind of discussion we’re having: that yeah,   we don’t know what’s going to happen, we don’t  know if we do something it’s going to help or   not — but I think there’s a real chance it  could, and we have to take those chances. Rob Wiblin: You made a massive shift in  what you were working on in 2022 and 2023,   going from focusing on capabilities to  focusing on reliability and safety and   so on. Do you think other people  in AI who are more senior perhaps   underestimate their ability to make a  big career change and to switch their   focus? I guess [Geoffrey] Hinton  did roughly the same thing, right? Yoshua Bengio: Yeah. I guess it’s easier  for people who are already established.   I see a lot of my students who seem to understand  what I’m talking about and kind of generally agree   that this is dangerous, but in their mental  decision-making calculation there is like,   “What about my career, my family?  I need to have a good salary.” Rob Wiblin: I feel like there’s  reasonably good money to be made in   alignment and safety and reliability work as well. Yoshua Bengio: But not as much. Rob Wiblin: Not as much, no. It is  less, but good by any normal standard. Yoshua Bengio: Yeah, I completely agree with  you. But there is a professional anxiety in   machine learning students, which is kind  of surprising. If I go back 10 years ago,   or even 15 years ago, even before deep  learning was something people talked about,   the salaries for people coming out in  my group with a PhD in machine learning   were nothing compared to what we have now. But  people were not as anxious about that somehow. I don’t know, maybe it’s a status thing.  Because there are these crazy salaries,   people feel drawn to this as they have to  achieve that status even though they don’t   actually need to earn millions of dollars per  year. It is much more important, in my opinion,   to think of what kind of world they will  live in or their children will live in.   But that is what’s happening. Again, it’s  not rational. It’s human psychology at play. Rob Wiblin: Back in 2023, I think you  gave a p(doom) of 20% in an interview.   I haven’t seen a p(doom) that you’ve given  anyone since then. Would you venture to say   whether that’s gone up or down? Or are  you staying out of the p(doom) game? Yoshua Bengio: I’d rather stay out of the p(doom)  game. But let me explain why. It’s connected to   my discourse about uncertainty that I keep  saying: I myself don’t feel 100% sure that   what I see as plausible is going to happen, but  I do recognise that there’s a lot of uncertainty.   So putting a number like this is making a big  commitment about what’s going to actually happen,   where we don’t have scientific data  about how to calculate such a number. So I’m much more comfortable with saying,  well, it could be small, it could be large,   but that’s a large interval in which  the probability is way too high for   my taste and for the future of  my children. So whatever it is,   so long as it’s not 10^-20, I’m not  happy, and I’ll do something about it. Rob Wiblin: Final question: Back in 2019, you’d  heard the arguments, but you weren’t bought in.   What would you say to someone who is still  today where you were in 2019, who has managed   to get through the rest of this interview?  What would you want to communicate to them? Yoshua Bengio: It’s a good question. I would  say something that’s difficult for people to do:   try to leave your prior beliefs about intelligence   and the efficiency of markets or whatever  your beliefs, and just try to focus on the   evidence — the evidence that has been collected  empirically by the companies and academics and   nonprofits in the last couple of years  especially, but also theoretical evidence   that has been developed over more than a decade  in AI safety about the fundamental reasons why,   for example, if you do reinforcement  learning, you’re going to get reward hacking. I think a lot of people like machine learning  researchers simply haven’t even taken the time   to read those papers, so it’s easy to dismiss as  “these people must be biased by science fiction”   or whatever it is. When you actually look at the  theory and the experiments that are in front of   us, it’s much harder for a scientist to deny  that reality. So I would encourage a kind of   openness of mind to take the time to read  through the evidence before committing to a view,   and that will be the scientific thing to do. Unfortunately, there’s a bad polarising effect  here. Once a person commits to a view of it’s   going to be fine, for psychological reasons  it’s very difficult to back from that — because   you want to feel good about the things you  said in the past. So it’s difficult to say,   “I changed my mind, I made a mistake,” but this  is the right thing to do from an epistemic,   scientific point of view. If  scientists didn’t accept that   they could have made mistakes in their  theories, interpretations, and so on,   then we wouldn’t have progress. We wouldn’t  have scientific progress. It is when people   are willing to question their own beliefs and  look at the evidence that we can make progress. Rob Wiblin: My guest today has been Yoshua Bengio.   Thanks so much for coming on The  80,000 Hours Podcast, Yoshua. Yoshua Bengio: Thanks for having me. Rob Wiblin: And thanks for all you’re doing. Yoshua Bengio: And you, too.