heading · body

Transcript

Metacognitive Intelligence In Human Ai Teams

read summary →

TITLE: Metacognitive Intelligence in Human-AI Teams CHANNEL: Santa Fe Institute DATE: 2026-05-01 ---TRANSCRIPT--- Thank you for having me. I appreciate it. I appreciate you also taking the time out of your lunch, filling yourselves up with food, and then immediately sitting down in a dark room. That’s impressive to me. I made sure not to eat very much to make sure I could at least stay awake for this whole thing. I’m going to talk to you today about research that is very new to me. I only recently became interested in AI and in the promises of human-AI collaboration. As Melanie I mentioned I’m at the University of Illinois. I bring greetings from beautiful Champaign-Urbana. In doing a little bit of preparation for this visit, I learned that we share some historical heritage. David Pines, who of course, as I understand, was a founding member here, was also on the faculty at Illinois in physics and founded at Illinois the Center for Advanced to study along with several other people. I’m interested in the evolution of institutions, so I found this interesting. At Illinois, I run the Human Memory and Cognition Lab, which is the image notwithstanding populated by humans who assist me on research on a wide variety of topics. Here are some of them. I put these out here so you can see the variety of things we do. Today, I’m going to talk about some of these domains, collective intelligence, Intelligence, metacognition, human-technology interaction. But for those of you who have interests in these other domains, I’m happy to talk about other things as well. I have a lot of interests. I want to be sure to thank the people who did the actual work that I’m going to talk about today. Specifically, Belgin Unal was a grad student in my lab who’s now at the University of Michigan and did the human-human group interaction work I’m going to talk about today. Jingfeng Zhang is a current student in computer science at Illinois who works with me and others. Will Deng and Jonathan Ukemba are both graduate students in psychology at Illinois. And then my longtime collaborator and also, I think, friend of the institute perhaps in some way, Mark Stivers at UC Irvine. I’ll talk a little bit about some older research we’ve done. And also, he’s been an important sounding board for a lot of the work I’m going to talk about today. Forgive me, I want to start with a little bit of autobiography, which is how I came to an interest in this. I, like a lot of cognitive scientists, I learned about neural networks as an undergrad. I was an undergraduate at Carnegie Mellon. I took a course with Jay McClellan, learned about connection back then. I would say they went through a period of growth in the 1980s, and by the time I was in the throes of graduate school in the mid-1990s, they were less impactful. And it was partly because of demonstrations like this one. Forgive me if you can’t see all the details here, but this is an example of an attempt to train a perceptron, a fully connected network on the identity rule. So I give it 1, you should give me 1 back. I give it 2, you should give 2 back. And it was a provocative and important demonstration that a perceptron couldn’t really learn this. It can’t extrapolate outside its training set. And this seemed to those of us deep in cognitive science to be a real fundamental limitation of these kinds of networks. That is, they could never propositionalize rules. They could never learn abstractions. Whatever they were going to be able to do, it was going to involve interpolation. And it did a good job of that. But I’d say that’s partly why these types of modeling techniques were in the garage of most cognitive scientists through the ’90s and 2000s. And yet, this is the autobiography part. Then 2022, ChatGPT came along. And I think it’s fair to say if you do not reflect upon the miracle that is this experience, then you are different than I do. I would say even the most diehard statistical learning people among cognitive scientists would have never believed that you could get a transformer network, which has, of course, at its core, the same fundamental architecture as other neural networks, to be able to engage in such sophisticated language simulation ability just by training it essentially on the statistics of language. That was shocking to me. And just as a reminder, it’s hard. Like, we can’t do this. This was a cute meme I saw. Florida man was arrested for attempting to baptize an alligator in a Waffle House using a picture of IST, I couldn’t predict a single word that would’ve come up in this sentence. I couldn’t either, but an LLM could, and that’s amazing. And so in 2022, when these networks came out, me and Mark Steivers were talking and we agreed we have to devote some of our research agenda to the study, both of how these networks work, how it is that they do everything they’re doing, and also how it is that we can work successfully with them because they are going to be our partners for the foreseeable future. And I apologize, I should have mentioned this earlier. Please feel free to stop me along the way if I speak too quickly or if you have questions, anything you want to interact with is fine. Okay, now I would say the field has most, I shouldn’t say mostly, there’s a lot of efforts and attention devoted to the question of whether or not LLMs and those types of models exhibit artificial general intelligence. And I would say, as you know, of course, there’s widespread agreement on the topic. So here, Tim Detmers, AI, AGI will never happen. It can’t happen. There are fundamental bottlenecks in these types of architectures that will lead them to never be generally intelligent. At the same time, not only is AGI imminent, it is already here. We are experiencing it and living through it. These articles are months apart from one another. And then I guess there’s a third compromise view, which is that maybe AGI just doesn’t really matter that much. I think there’s something telling in the fact that people can have such widely varying opinions about this. And that’s partly because, as Mel and I were talking before, I think this is a somewhat degenerate question, whether or not AGI, whether or not AI is ever going to be generally intelligent enough. There’s no agreement on what it means for a human to be intelligent. So I don’t know how it is that we’re going to hold an AI to that same standard. Yeah, question? I expected you to have a sample point from AGI will happen, but not with this architecture. Yes, I think that’s another fair point. And maybe, and it’s hard to say, in part because the, let’s see, I have another slide here. One of the things that happens with the assessment of intelligence in agents like LLMs is that as soon as we create a way of measuring it, it becomes subject to Goodhart’s Law. If you haven’t heard of Goodhart’s Law, this is one to put in your back pocket. This is a really useful thing to think about. It is the idea that as soon as you find a way of measuring something, it is no longer useful as a way of measuring it, with the hidden step of being that as soon as you have a way of measuring something, it can be gamed. Okay? So in the case of AI, the way it’s gamed is that you teach the agent, you teach the test to the agents, they learn to do well on the test, They pass some standards, some newly discovered, newly developed benchmark for intelligence. But none of that knowledge generalizes because it hasn’t been trained in such a way that the abilities that we think underlie that are actually what are being tested. This is a good example from outside AI of Goodhart’s Law, which has to do with length of stay in hospitals. So of course, if you’re running a hospital, if you’re doing a good job taking care of patients in your hospital, they will get out sooner. That’s sensible. That’s a good starting point for you. The problem is, of course, that once length of stay becomes a thing that you get rewarded for, then there are other things that can also influence length of stay. So in this example, of course, if you let patients go soon, too soon, in fact, then they come back to you and you’ve solved the metric problem. You have good statistics with respect to length of stay, but you have sick people who are coming back to you and you’re not solving the real problem, which is with human health in this case. So I guess I take as my own perspective, rather than sitting around trying to figure out whether or not an AI that I’m working with is intelligent or not, I want to understand them the same way that I understand humans. And when the way I understand humans is by running experiments on them and seeing how they do and trying to come up with particularly clever experiments to try and piece together how it is that they’re doing what they’re doing, the limits of what they’re doing. I build computational models to try and simulate what they do so I can work with model systems. And we want to do the same thing when it comes to AI, or at least that’s the approach we’re taking. There are lots of really smart suggestions as to what is currently missing from AI. AI has all kinds of capacities, but it always seems that around the corner it has yet more capacities. And here are some of the current suggestions. I think these are all right to one degree or another. So learning is a big deal, of course, in the context of LLMs. They do— I like to think of their training as school, and then they finish school before they really ever start getting work with. The reinforcement learning that they undergo is kind of finishing school for them. They’re not really learning facts anymore at that point. They’re being trained how to behave appropriately with humans. And it’s an important part of the criticism of these types of models that learning stops because that’s not at all true of humans. And it’s an important part of the human endeavor that we are constantly seeking out environments in which learning is important and changes us. Another fair criticism is that these models lack world knowledge. They don’t exist in a universe that has the same physical and perhaps psychological laws as the one that we live in. And consequently, they’re not being trained by those sort of constraints in the same way that humans are, and that will ultimately lead them to shortchange what humans can do. Robustness is an important aspect of evaluating AI capacities and something that’s currently missing. They’re being taught abilities without always being evaluated with respect to how well they can generalize those skills or abilities to new situations. I think that’s fair. And this final paper on which your esteemed colleague Melanie is a co-author, I think this is an important paper on how it is that AI should probably be taught to have a little bit more foresight, what they call wisdom, and what I’m going to call through the remainder of this talk, metacognition, the ability to reflect on themselves and to reflect particularly on their own states of knowledge and their own progression in learning and to remediate those states through behaviors that it has in a toolbox. So So what can an AI that’s metacognitively intelligent do? I’m going to move over here because otherwise it’s too close for me to read. An AI that is metacognitively sophisticated can give you calibrated confidence assessments. It doesn’t assess confidence randomly the way that agents, LLMs currently do. It can do it by specialized mechanisms. It can evaluate explanation plausibility and simplicity, and consequently can provide explanations that are plausible and simple to their human partners. It can ask for clarification when a question is ambiguous. They don’t do that. You know this. It can delegate among a team if it’s part of a supervisory role over a team of AIs and humans. It can learn the strengths and weaknesses of those partners and delegate parts of tasks to those teams, to members of those teams. It can work with a partner, and it can weigh that partner’s advice appropriately. It can learn about the accuracy of this person or the expertise profile of this person. It can also learn about itself and learn to compromise by evaluating its judgment in light of a partner’s judgment. And that partner could be human or another AI system. It can defer responding until it knows that it will receive information that is relevant to making a decision. It can say, OK, you’ve asked me this, but I don’t want to give you the answer to this until later when these data will have come in that will help us adjudicate this in a sensible way. It can engage in risk assessment. This is a really important one. Right now, an agent that gives you an answer is not fundamentally taking into account the consequences of being wrong, right? So if this is a particularly high-stakes decision, it should demand for itself more information before it provides you with a definitive response. They can’t do this now, and they certainly can’t do it if they don’t have the capacity to reflect on themselves. It can seek additional training and say, okay, I seem to be doing well in this domain, but not in this domain. And it seems to be because I lack this knowledge. I can ask my human partner to provide me with additional training, or I can seek out on my own sources on the internet and things like that that I think will help remediate, smooth off the rough edges of my knowledge. All of this, if we are talking about these abilities, if we were talking about them in the context of humans, we would refer to them, I think, as a degree of intellectual humility. That is that they understand something about the limitations of their current cognitive capacities, and they have a set of tools with which they can go about trying to fix those. I want to turn for the first part of this talk to talking about what makes humans work well in groups. And we have here examples of some groups that work well and some that don’t work well. I’ll leave it to the audience to make up their own mind about that. Quiz Show is a particularly interesting example. In Quiz Show, teams from colleges work together to answer questions, and teams are composed of people with varying expertise. And then of course, then you could make decisions like, ah, this is our guy that knows about history, and this is our woman that knows about math. And you could call upon the right people for a particular question. Other teams for working together lose a lot of information in the process. Other teams attempt to be optimal transmitters of information. And maybe approach that. But the goal of starting by looking at humans is the idea that humans have metacognitive capacities. It’s an important part of how they work successfully in teams. And I wanna unpack for you a little bit about how that works. So the first, the place to start is that in psychology, there’s a, I would say maybe even an odd fetishism with the wisdom of the crowd, which is a really important phenomenon, namely that when you aggregate usually estimates, but it can be other kinds of things across lots of individuals, what you find is that somehow within that bank of individuals is held something that is often very close to the truth. The historical example from Galton was guessing the weight of an ox, which hundreds of people did at a county fair. And though no one got it right, the aggregate of all their judgments was extremely accurate. And this is sensible, right? The idea is that error is idiosyncratic, but truth is shared. And consequently, if we average out all of our idiosyncratic error, then what’s left is the truth. And that’s led to one story about the value of group decision-making. And that story is that, well, of course, they collectively possess more knowledge. If you have four heads, not four heads, 4 heads next to one another, they possess more information within those 4 heads than one head does. And consequently, if you can extract it, you can do better with groups. Let’s start with that, but bear in mind that I don’t think that’s gonna be the real story about why groups end up doing better. So I’ll tell you briefly about an experiment from a paper from my group and from Mark Steivers, a collaborative work. It’s a really simple experiment. I’ll take my time to make sure that everyone understands ‘cause this is gonna be similar to other experiments I’m going to talk about today. People are answering questions, in this case, general information questions that have two alternatives. There’s a clear right answer. And in this version of the experiment, there are two conditions that each person is assigned to. In the random participation group, of the 100 questions they’re asked, they’re obliged by the experimenter to answer a random 25 of them. They could be on any topic. There’s a wide variety of topics represented on this little quiz. In the opt-in group, the people, the subjects in that condition get to choose which 25 they want to answer. So it should be apparent to you that the people who are in the opt-in group will do better. It’s not a fair comparison, right? They get to choose, and to the degree that they have valid metacognitive knowledge, that is, they know a little bit about what they know and they don’t know, they should do considerably better. But bear in mind, that’s not the question we’re looking at here. That’s straightforward. The question is, what happens to groups that are composed of those individuals that are self-selecting? And in a second, I’ll show you why it’s actually not so obvious that those groups are going to outperform the other group. Here’s one way of looking at the data. We have plotted here as post-hoc, we’ve plotted the questions on the abscissa here ordered by how frequently people chose to answer them. So on the left side, you have questions that people most often chose to answer. On the right side, questions the people least frequently chose to answer. Each square that’s filled in represents an answer that was provided. A red square indicates that it was a wrong answer. A green square indicates that it was a correct answer. In the random group or control group, as we call it here, you see that these squares are randomly distributed, as has to be the case because we forced them in there. You also see that in general, they’re being less accurate on this end than on this end. That is, is as a population, they are correctly choosing which questions are easier to answer. And you can see that in the self-directed or opt-in group, they’re really, really answering mostly easy questions themselves, and they’re leaving the hard ones out. And that’s why it’s not so obvious how this group is going to perform. Correlations and knowledge are a tricky thing to deal with. And in this case, one of the troubles that they invite is that if everybody is choosing to answer the same questions, then that’s not so beneficial for the group behavior. In fact, there’s a bunch of questions out here on the end that aren’t answered at all. And of course, the team that is composed of these is going to get those wrong by definition. There are many more that are only answered by a few people. Okay? So the question about which teams will outperform is a question of the trade-off between metacognitive sophistication, the level of individual knowledge, and then the balance of distribution of correlated knowledge across the group. Nonetheless, what you see is by— let me point to the screen— by any measure that we traditionally use to assess groups or wisdom of the crowd kind of benefits in these groups, you see that in fact the— maybe I don’t have this labeled clearly, excuse me— the crowd that is composed of self-selected individuals actually outperforms the crowd that is composed of random So what I don’t like about this, and what I also don’t like about Walden’s thing, is that the method of aggregation is by, in this case, majority voting, or in his case, by averaging, which I think is the least interesting. I mean, what about deliberation? Yeah. Oh, I couldn’t be happier that you asked that. Okay. Is that the next slide? It is almost the entirety of the rest of this section of the talk. Yes. Yes, that is an opinion I share, is that it is interesting and revealing about population-based knowledge that you can aggregate in really simplified ways and get enhanced performance. But that does not imply at all that that is how groups operate in service of making good decisions. But what this lesson tells us from this study is simply that that there is a role for metacognition and group knowledge. We can see that an individual who thinks about what they know can actually compose or be part of a composition of groups that outperform individuals that don’t get that opportunity to self-select their own participation. But as you mentioned, interaction isn’t just the— we’re not just putting a shovel in people’s head and putting all the information in a pile. Metacognitive exchange in the course of interaction is a critical part of what we think of as regular collaboration. People sit together and they do all kinds of things to exchange metacognitive information. At the simplest level, they might just say, I think this, and here’s how confident I am. And then you say, well, I think this, and here’s how confident I am. That’s a starting point. It’s an obviously simplified view of how people interact in groups, but it’s a starting point. If you hold on to that idea, we’re going to get at the end of this section to a collection of ways in which people might act sophisticated, in sophisticated ways and groups to exchange metacognitive information. Here’s a starting point for what kinds of things you can do if you do this. You get to learn about your partner’s knowledge and expertise. You hear them say things. They provide explanations. You start judging them. They are judging you. You. Okay? You get feedback on your own judgments, which you don’t if you work individually or if your information is simply summarized. You develop a common language for communicating uncertainty. This is a, a really interesting experimental demonstration that I’m not gonna tell you about, but people who work together converge on common definitions for terms that they use to talk about confidence and certainty. And maybe most importantly, they construct joint explanations. We take a little piece of what I said, we take a little piece of what you said, we take a little piece of wood, he said, and together we construct a novel solution that’s different than what any one of us came up with. And out of that we get enhanced group performance. Okay, so in this experiment, which has a similar task underlying it, we’re gonna compare individuals who are answering two alternative forced choice questions and rating their confidence with groups that do do that. And I wanna point out that this is a really, really minimal interaction for each one of these questions. And I can’t remember how many there are, 50 or something like that. They’re interacting on average about 10 to 15 seconds. So this is not a deep involved long-term collaboration. This is a really, really short time span where they’re exchanging just a little bit of information. And I will tell you, because it’s not an important part of what we’re gonna look at here, Groups will do better than individuals. That’s not what’s at stake here. What we’re gonna do is we’re going to take the individuals, the ones that were in the individual condition, and we’re gonna simulate using their behavior what simple metacognitive strategies would yield in terms of performance. So this is the simplest one and the dumbest one. Call it here no information sharing, though I should probably call it something like metacognitively naive groups or something like that. They exchange no metacognitive information. This is equivalent to the wisdom of the crowd kind of behavior where there’s a pair of these individuals, they’re not really working together, let me remind you of that. But for each question, we simply randomly choose one of their answers. This condition won’t outperform individuals, though its variance will be a little less for reasons that are not all that interesting. But the important thing is we’re going to compare it to a group that simulates a minimal amount of metacognitive exchange. We take that same pair that we’ve artificially put together, and for each question, we simply choose the answer from the person who expressed higher confidence. In this case, this person expressed higher confidence for the question about snakes. This one expressed higher confidence for the question about Budapest. And out of this, we’re gonna get predictions of behavior that we’re then going to compare to actually interacting groups. Okay, so here’s a distribution of scores generated from all the pairwise combinations of individuals in the individual condition. We put them together, we randomly picked one of their answers. The answers aggregate, I don’t know whatever that is, 67%, but you get a nice smooth distribution. Distribution because there’s lots of 52s, 2 of these or whatever it is. Importantly, the maximum confidence simulated nominal group hugely outperforms this. I realize this is not a typical way of showing data, so it’s hard to see how big this effect is. But this translates into a probability of superiority of something on the order of over 80%. If you randomly selected one of the values from these two distributions, be greater 80% of the time in in the confidence group. So we can say that, okay, we see that we can improve performance in group behavior simply by selecting on the basis of confidence rather than random selection, which shouldn’t surprise you. But the important thing is that actually interacting groups wildly outperform either of those. Actually interacting groups are doing something beyond simple confidence trading. They’re exchanging something more sophisticated than that. And I promise you, I promise I will tell you a little more about what I think those things are. Question back there. What was the incentive structure for the individuals and the groups? Like, what were they instructed? Were they getting the most points? Like, how did they— Yeah, what was the— So no incentives in the form of money or anything like that. They are told that they should try and answer as many questions right as possible. They do get like a score on it, so they’re keeping score along the way. Groups, there’s a concern, though I’m not going to talk about it here today, that in groups you get this diffusion of incentives in the form of social loafing and things like that. But you don’t really get them in groups of size 2, which is why we do this with small groups. When we’ve done this with larger groups, that’s a real problem. If you kind of switch that, like even if you’re just using points and let’s say you gave them 1 point for correct and -10 points for negative, does that change the types of behaviors? Yeah, it’s a great question. We don’t know. We’ve never done anything like that. Anything like that. Certainly, there’s lots of experiments like that with individuals and examining how they use their metacognition to regulate responding in that case. How groups coordinate that dynamic risk assessment would be super fascinating to do, but I don’t know anything about it. Yeah, it’s a little bit hard to see, but as the graph showed, the distribution showed, it seems like the mode between the max confidence and interaction is not that different. But it wraps it up, the bottom part of it. Will you explain that? We’ve noted that too, but I want to point out a weird quirk of this, which is that these two distributions come from randomly sampling pairs of individuals and resampling that process over and over again. This is actually just 50 people that work together. This distribution is much less well estimated in shape than these distributions. So we’re not at all confident that that’s a real thing, but we, we’ve wondered about the same thing. Oh, I just had a question about confidence and the real interaction. So if I know about someone’s confidence level who I don’t know, it’s more difficult to assess than when I interact with this person. So if I say how confident I am about something, that means something else than Yeah, it’s a great point. I’m going to talk a little bit about that in the context of human-AI interaction coming up in the next section of the talk. But yes, in order to evaluate confidence, you need to know about their calibration. Some people are uncalibrated and give wildly off confidence ratings. And then it’s hard to know how to weight them. And that’s an important part of what goes into actual interaction is learning about how calibrated they are. Well, I guess this is almost repetition, but I was going to ask it before and then thought not to. But so what about something like the Dunning-Kruger effect where people notoriously miscalculate their confidence? Yeah, I’m thrilled you asked that. It’s coming up, maybe the next slide, but if not, the slide after that. Yeah, because there’s a really interesting thing that falls out of that. Okay, just to summarize, we see that these brief interactions 10 seconds allow you to outperform even a relatively sophisticated interaction simulation in which people are exchanging confidence, and by a considerable amount. All right, here’s that slide. We can also look for each of these conditions, the individuals, the simulated groups, and the interacting groups, their calibration. Okay, so what we do is we take their confidence, we bin it into groups, and then we look at the accuracy within those groups of judgment judgments. If they’re well calibrated, they will fall on this line. If they are overconfident, as they often are in Duning-Kruber type exams, you’ll see more of the data collected down here. As you probably know, overconfidence is ubiquitous in human judgment in cases where people are making judgments about familiar kinds of things. And it appears here as well. But— oh, I’m sorry, I moved the graph over here. The interesting thing that fell out of this that we really didn’t expect was that groups are the least overconfident and most calibrated of all of the conditions that we’re comparing, right? So compared to— this is— these are the, the no info sharing and maximum confidence groups. Individuals are down here in this range too. I didn’t put them in this function here. Groups, whatever it is that they’re doing in this 10 seconds of interaction, it’s also allowing them to become not calibrated to each other, but calibrated to reality. And so they’re less often, they show less overconfidence. This group, people are giving their confidence in the form of like a percentage as opposed to a verbal label. Yes. Yes. Though when they interact, they’re probably doing it in the form of verbal labels. Yeah. And my other question was, in, with all, a lot of these problems, I mean, a lot of these questions, if you would imagine some questions needing 2 seconds of interaction and some needing 60. Like, you know, no mammals, you know, give milk to their young. Right. Well, I guess that was under 10 seconds. But so do you— that seems like an interesting condition, especially given that some of the questions seem to be markedly harder. You saw some where everybody answered them and they were all wrong. Yeah, there are misleading questions that they weren’t intended to be, but they turn out to be. Yeah. The— I don’t have much to say about that, but I will because what we’re doing right now is a project where we’re following up on this and actually doing actual natural language processing of the interactions themselves so that we can try and build text decoding models that tell you what it is about these conversations that then leads to group judgments that outperform the individuals’ answers that they brought with them to those judgments. But for now, it’s hard to know, but there certainly are item-based differences. You’re right about that, though we don’t let them go 60 seconds, so none of them take that long. I don’t want to slow you down too much, but— No, sure. Seems to be a difference between fixed knowledge versus problem solving, where in fact problem solving may depend on having different perspectives coming together. It’s more of a but it’s a different thing. Now, are these questions mostly knowledge-based, or do they include problem-solving where 2 people would do better than 1, or 10 people would be better than 2? Yeah, it’s a really good question. I think that it’s not quite as sharp a dividing line though as you’re imagining. So you can know something by fixed knowledge. You can know that the only mammals that lay eggs are platypuses and echidnas. Anyone know? Any mammal experts in the guy? I think that’s right. Okay. But you might not know that, but you might know that, okay, to be an egg-laying mammal or something like that, you have to have these qualities and live in these places. And then the other person says, okay, but that’s also where platypuses live. And so you actually can work together to recover facts that may or may not be in your own individual head, but that you can derive from first or second principles if you work together. But at the same time, there’s a classic question. Some of the greatest questions, I mean, depend on the diverse feedback. Absolutely. Yes. My question is whether your conclusion, your inferences would change if you move to that. I think when you move to things like, I think a good example of those types of questions is in forecasting, forecasting geopolitical events or things like that. Yeah. Those are things that I think benefit most most from these types of interaction. Yeah. So in some ways we’re understating its potential here. Okay, we’ve been through this. Okay, I’m going to do this quickly. I’m thrilled to have all the questions, but I am going to bypass some of this a little bit just to make sure we have time to get through. This is a replication of this effect with misinformation detection. We’re trying to— this is an undergraduate. These data literally are being presented, I think maybe as we speak at a conference right now. So these are steaming off the press. But she showed that interacting groups outperform individuals in misinformation detection, and also that interacting groups outperform the same kinds of nominal group combinations that I told you about before. There’s also this perfect metacognition benchmark that I didn’t mention, which is the idea that if two 2 people are working together, if either one of them has the correct answer, then we score it as correct. If they could completely accurately convey to each other who has better knowledge, then the OR rule gives you a rule for deciding when they have perfect metacognitive knowledge. How do these questions relate to the actual creative process of 2 people coming together and trying to come up with a new theory? Yeah, thinking together for creative purposes, I don’t know. I mean, I— Interesting question. Yeah, it’s super interesting to think about, for example, not how you solve a problem in science, but how you generate a new problem in science. How do you think about a new way of asking these things? And I think it’s going to take studies that allow people to interact for more than 10 to 15 seconds. There’s studies of these kinds of things where people study this, but it’s going to take different paradigms than what I’m using here. Okay, here’s a slide I promise where I’m just going to review— I’m not going to review any of the evidence that comes from our work on this, but I’ll tell you a little bit about some of the things that I think groups can do even in these brief interactions that allow them to outperform these nominal groups. One is that not only do they evaluate the answers that people provide, they evaluate the quality of the explanation. That they provide. And it’s noteworthy that this parallels an important recent-ish development in AI where explainable AI techniques, tools that allow AI to explain its answer, also lead to increased trust from their human partners and increased use by their human partners. People want to hear an explanation. They want to be able to evaluate that explanation and judge you for it. And group interaction allows you to do this. Similarly, problems, even simple problems like the ones we’re doing with here, often have a multidimensional component to them where I might know part of an answer and you might know part of an answer, and we put that together. We learn, as I mentioned before, by calibrating to one another. We learn that, oh, okay, this person expresses certainty a lot, and maybe I shouldn’t take that as evidence that this is highly accurate, for example. And of course, groups that work together can flexibly employ different decision strategies depending on the problem. So turns out that optimal decision strategies depend a lot on whether or not the partners in collaboration are close to one another in ability or widely different from one another in ability. And this can differ on a problem-by-problem basis. And groups allow themselves the opportunity to opt in and out of different strategies for solving those kinds of problems. Okay, back to AI. So we talked a little about humans. I’ve given you some thoughts about why it is that humans interacting in groups might outperform individuals. And now let’s think about humans interacting with AI. People seem to agree that we are headed for a future in which a lot of important decisions are made by human-AI teams, human-machine teams, hopefully with a human still in the loop from my own perspective. But I, I think maybe the best quote on this comes from Garry Kasparov of all people. True collaboration is not about dividing the work between machines and people, but about bringing the strengths of both together to solve problems and achieve more than either could alone. I think that summarizes what a lot of smart people have said about human-AI interaction, including Terence Tao and things like that. But that, that’s the idea for human-AI agent collaboration to work, we have to be thinking about the relative strengths of both, just like a partnership between two humans. Here’s a recent paper, a really recent paper. I think it was just from a couple weeks ago on evaluating AI agents in the context of medical decision-making. And they did a really good job. They have a bunch of interesting questions they’re asking to LLMs. These are standard LLMs. They are not bespoke medical-oriented ones like Open Evidence and things like that. Here they are. And they’re evaluating each one of these agents on the quality of their answers to both closed-ended questions and to open-ended questions. And I don’t want to talk at all about the results. It knows some things and not other things, just like everything else. What I want to point out to you is this, and this is buried in a couple quotes in the paper. Cross-model outputs were consistently expressed with confidence and certainty, with a low prevalence of caveats and disclaimers, and rare refusals to respond even when answers were contentious or incorrect. There’s nothing about LLMs that is designed to be cautious in answering your questions. There’s everything about them that’s designed to give you an answer that makes you happy and keeps you on the product. Okay? Consequently, for 250 total questions across 5 different AIs, there were all— actually, I think that is across the 5. There were only 2 refusals to answer, both by Meta AI of all things, which is interesting. That is, they are not designed in such a way as to think about prior to giving you an answer about whether or not it’s going to be beneficial to you and the agent as a team to have that answer. And that’s a real concern. And so I want to turn in this part of the talk— oh, I’m sorry, I have one more slide I forgot about— another important example of claiming that overconfidence and miscalibrated confidence more generally is a feature of LLMs that will not be overcome. This is from a blog post or a Substack post by Helen Toner. What I want to ask is the question of how do we take simple neural networks and make them metacognitively competent. And there are simple solutions to this problem. And then there are novel solutions to that problem. And I’m gonna tell you about them all at once. We’re gonna do this in the context of two different architectures. And I don’t assume that you are all familiar with standard neural network architectures, but I will only give you the briefest of introductions to them. This is a convolutional neural network. Neural network, which is used prominently in computer vision examples. It’s taking images on the front end, on the back end, the fully connected layers then give you low-digit outputs for the different categories that it’s learned to classify. In our case, we’re gonna train this simple neural network on what’s called CIFAR-10, which are 10 different categories of things learn to classify. That’s our simple case. And then we’re gonna take a complex case, ResNet-18, which is a variant on a convolutional neural network that has an additional capacity of pushing through to further layers, both the input that came into that layer as well as the transformed output. If that’s not interesting or not meaningful to you, really going to matter. We’re using these two examples as simple sort of test beds for testing out different ways of extracting confidence from neural networks. Oh, and I forgot to mention, this is being trained on 100 object classifications. So these are— this is a hierarchical set of things that it can learn to classify. Again, all these cases are visual classification. Dogs are one of the categories in both of these cases. The most common thing you see when people talk about extracting confidence from neural networks is simply using a softmax transform at the output layer. The output layer, like I say, is in logits. You run that through a softmax transform, which turns into probabilities. And then you say, well, probabilities look like confidence judgments when they’re on on a 0 to 1 scale. So we’ll call that the confidence rating. That’s one of the things we’re gonna evaluate in the testbed that I’m about to introduce you to. These other terms come from similar measures. Negative entropy applies more broadly across an entire set of outputs rather than just a single output. Temperature scaling is a means of calibrating that post hoc so that confidence can be tuned to accuracy. There are also techniques that rely on understanding the representation of objects at the hidden layers in the network. So we can use, for example, a high-dimensional representation of a queried object and evaluate how close it is to other objects within one category, another, and use that distance as a measure of confidence. Or how many objects from a similar category it’s near in k-nearest neighbors and use that overlap as a measure of confidence. That’s another way of extracting confidence. What I wanna spend more of the time on what I’m calling here consensus methods. And I apologize, I haven’t really settled on a term for this. So what I’m gonna do is change that term every couple slides. But these are methods where the idea is we are going to resubmit a query over and over again to the agent, and we’re gonna look at the consistency of its responding as a basis for evaluating its confidence. One way we can do that is by image blurring. We can take the picture of the dog dog, and we can put noise on it, different noise 50 different times, and then look at the coherence of its response over those 50 queries. If it says dog every time, it should be highly confident that’s a dog. It says dog only slightly more than half the time, then it should be less confident. Similarly, Monte Carlo dropout is the same idea, but rather than applying it to the input, we’re gonna apply it to the actual, one of the actual hidden layers of the network. Okay. So each time we run the picture of the dog through, we’re doing it with a handful of those hidden units dropping out of the process. That’s been done. Well, we’ll talk about that in a minute. Okay. So here, I actually, I think I’ve talked about probably everything that’s on the next slide or two. I forgot that I added these slides at the last minute. So just to remind you, some of these are measures that we’re extracting at the output layer. And to remind you of another effect, this is This is what’s done, I would say, in 90% of the cases where someone’s actually trying to extract confidence from a network like a convolutional neural network. We have the distance-based metrics that I told you about. And then we have the, see, I told you I was gonna change that term. I called it consensus before, here it’s perturbation. We can perturb the image with noise, or we can perturb the hidden layer by taking out units over repeated samples that stimulus. And one theme across a couple of these approaches is the idea that we can use the robustness of a response as a measure of confidence. So KNN is a measure of robustness to its particular location in the high-dimensional hidden layer space that’s representing the object. Consensus over noise, it can be a measure of robustness lossness and consensus over hidden unit dropout, like we talked about. These are all techniques. What’s nice about them is that they can apply to any network structure. OK? They don’t require— so an LLM has an output layer that’s huge. And it’ll be tough to use softmax at that. But this is something that will be easy to implement. OK. And I want to point out that this idea actually came— this idea of Using robustness, and particularly using perturbation robustness, came from an idea in psychology about how humans generate their confidence. The idea being that when they have a challenging query, they submit to themselves multiple versions of that query, and the consistency of the answer they come up with determines their confidence in that answer. And for those of you on the computer science side, you know that similar techniques have been used for a long time to train on the training side, to train networks to be more robust. Okay, so here’s what the data are going to look like when they come out of this analysis. So this is what’s sometimes called the reliability diagram or confidence accuracy plot. It’s the same kind of thing I showed you in the first sets of slides. We’ve been confidence and then we look at accuracy within each one of those bins. This one’s pretty good. This one’s also not bad, but it’s a little overconfident. You can see that because these bars are mostly below the diagonal. We’re also gonna look at receiver operating characteristics. For those of you not familiar with them, that’s a means of taking a continuous output, as all of these are, and translating them into binary decisions by looking at all different possible response policies. From maximally conservative response policies to maximally liberal ones. The area under that curve is considered to be a metric of the diagnosticity of that system. Okay. Okay. So now let me walk through the 4 slides. Don’t get hung up in— I realize I’ve overloaded these slides with information. You’re not— the goal is not for you to look at all everything that’s on these slides. For the convolutional neural network, network, we have kind of a tie. Techniques that are based on Monte Carlo dropout are doing about as well as the output layer techniques, okay? On the ROC analysis, that is, that tells us that the judgments are well ordered. When confidence is higher, performance is likely to be higher. When we look at the actual reliability diagrams, it’s a similar story. Though the output layer stuff is not doing quite as well, these are continuing to do well. I box, I should have said this before, I’m sorry. Each one of the red boxes indicates that it’s one of the top 5 of all the simulations that we ran in terms of performance. I would say it’s fair to say that both output layer methods and consensus methods are both doing well in the simple case, the convolutional neural network. Work that’s only doing 10 alternative classification. When we move to ResNet and the more complicated classification, the story changed a little bit. Both the output layer and the Monte Carlo dropout are doing well with respect to the ordering of their judgments. Okay? But when it comes to the actual judgments themselves, ResNet, as applies for ResNet, the reliability diagram the output layer techniques are falling apart. It’s hugely overconfident in both of these cases, we can see. And by the way, overconfidence is not some sort of mathematical consequence of anything we’re doing here. There are other systems here where the system is also hugely underconfident. But what this tells us is that measures, I think I say this on a slide, measures that are based on robustness actually generalize better across different, network architectures, the measures that are based at the output layer. And that’s, I think, an important lesson because, like I say, they’re not in use at all. I was a little confused by this because you changed two things at once. You had the Convolutional Neural Network, 10 categories, and then ResNet with 100. So I’m not sure which is— Yeah, it’s a good point. For our first pass, we were just sort of trying to do complex architecture, complex decision versus simple architecture. But we’re at the running now. We’re doing the crossover version as well. But I don’t know what the outcome of that will be. Sorry, I understand the AUC, the ROC curve for binary classifiers. But what about these 10-category or 100-category classifiers? So we’re treating the binary classification as right versus wrong. So basically, if it’s a dog, you can have a true— your true positive It was a dog and detected a dog or a false positive, you’re calling something else a dog. Okay. I think I said all that. Okay. I don’t have a clock in front of me. When did we start? A little past 12:30. Okay. I’m wrapping up. I promise. I know I’m past time, but this is the last section and it really brings the room together nicely, as Lebowski would say. So I want I wanna get through this, which is what happens then if you can develop an AI agent that is calibrated? What are the consequences for its work with humans? And I’m actually gonna move to a fake agent in this one, just a simulated AI system that’s gonna work with a human on this particular task. This is a perceptual density estimation task. You look at this grid and you have to estimate what proportion of the little blocks are filled in black. I think this example is 50%, though I don’t actually remember it. I’m not very good at the task. The way this task works is human sees the stimulus, they make their judgment from 0 to 100%. They make a confidence assessment, though in this case, the confidence assessment isn’t continuous. It’s a verbal label. And then they get feedback from a fake agent that gives them— the agent gives them its own estimate and it gives them a confidence assessment. Estimate as to that— excuse me, as to that estimate. The human then has the opportunity to revise their estimate after they do it. This is a typical judge-advisor system kind of paradigm. Now, the manipulation of this experiment is that the agent that provides the advice is calibrated. That is, when it gives higher confidence, it’s more right, or it’s uncalibrated. It gives the same— it has the same level of accuracy regardless of whether or not it’s expressing high or low confidence. And importantly, do I say this here? Yeah, the accuracy is the same for both these agents. So whatever differences we’re going to see, it’s not because one of these partners is more accurate than the other. It’s just because they convey better information in the content of their calibrated confidence judgments. Here’s human performance on the first judgment. This is a measure of error, so down is better. Okay, they do better from the first to the second judgment. They’re learning from their agent and they’re doing better. They’re working the agent as a partner. They do a little better in the calibrated condition than the random condition, which is our starting point for drawing the conclusion that calibrated agents make for better partners. But what I really wanna do is unpack this a little bit as to where that benefit comes from. So we’re gonna use here a couple different measures. One is a standard measure in the decision-making literature called weight of advice, which scores basically what proportion of the distance did you move between your judgments as a function of the distance between your original judgment and the agent’s judgment. Okay? So you could move all the way to the agent’s judgment, you could totally ignore the agent’s judgment, or you could move partway in between. And there are scores outside of that, but they’re not very common. Then because the distribution of scores that comes out of that has massive inflation at both 0 and 1. That is, people seem to be doing— this is what’s called in statistics a dual hurdle model— seem to be making a first choice about whether or not to ignore the agent. And if they ignore it, they essentially reject the advice, to accept it unequivocally, or to integrate its advice. And when it integrates its advice, we try and capture the amount of weighting by a beta distribution. Distribution on the appropriate interval. And what you can see is that when humans are working with a calibrated agent, they score well on all of these statistics. That is, they’re more likely to adopt the agent’s advice. They’re less likely to reject the agent’s advice. And when they— even when they integrate, when they come up with some average of their own judgment and the agent’s judgment, they’re weighing the agent more heavily. In that integration. The— all of this tells us, I think I said this already, but humans are more willing to work with agents when those agents provide calibrated judgments, because those calibrated judgments tell you about when and when not to take the agent’s advice, and that’s useful. I think I’ll skip this because I know we’re running short on time. But for those of you who are interested, we have developed also a cognitive cognitive model of how people engage in this dual-hurdle procedure, which is— that is not a cognitive model, but this is. And then this will wrap up, I think, the things we have to say about it. One is that human group performance benefits from the exchange of metacognitive information, including confidence. Also, standard AI frameworks can use robustness to perturbation as a means of deriving calibrated confidence assessments. We saw that in the second part of the talk. And then finally, humans are more likely to make use of advice provided by a metacognitive sophisticated agent that provides calibrated judgments. I have one more slide though. You don’t have to pay a lot of attention to it, but I wanted to give AI the last word here. So I asked Claude, hey, can you provide a nice pithy summary statement for a talk about the metacognitive components of human-AI interaction? And as it is wont to do, yeah, there kind of on point. Like it’s not wrong. It says things like, when humans and AI think together, the most important skill isn’t knowing what to ask, it’s knowing what you don’t know. Yeah, that’s pretty good. AI doesn’t just answer questions, it reshapes the questions we think to ask. And that’s where the real cognitive risk lives. I didn’t quite get that one, but you know, like it feels like English language. So I’m gonna give it some credit. But I followed up with another question. I asked it, How confident are you that these statements really capture the essence of what I’m asking for? And of course, now I’ve brought up confidence. Now LLM is going to follow some weird track in its space where it’s thinking about confidence. Like, oh, hold on a second. Honestly, I’m moderately confident at best. It doesn’t come across as moderately confident over here. It says, I should have asked before generating, but you didn’t. And you didn’t because you didn’t think of it because you don’t have the capacity to reflect upon it. This unless I invite you to. Okay, that wraps it up for me. Thank you for your attention.