heading · body

Transcript

Arjun Guha How Language Models Model Programming Languages

read summary →

Arjun Guha: How Language Models Model Programming Languages & How Programmers Model Language Models

Uh today we’re pleased to have Arjun Guha here to speak with us. Uh Arjun is a professor of computer science at Northeastern. Um if you actually go look back at prior talks on the uh New York Tech Talks list on the website, uh you’ll see that Arjun is actually the first person outside the company to come speak uh and give a tech talk here. That was way back in uh March of 2017. So, it’s been 8 years. It’s been a long time.

Uh he’s got a wide range of interests that overlap with people at Jane Street. Uh back in 2017 when he was here, he was talking about uh verification of system configuration languages like Puppet. Uh he’s also broadly interested in programming languages, systems software engineering, uh mechanistic interpretability, and making LLMs better at programming tasks, especially on low-resource languages like OCaml. I see we’ve got a got a lot of people from the compiler team here. So, this is going to be a good one for you.

Uh today he’s uh he’ll be talking both about how LLMs uh reason internally about working with programming languages and how humans have adapted their communication styles and working standards for working with large language models.

Uh we’ll have some time for questions at the end, and I hope you enjoy it. Uh welcome back, Arjun.

Thank you so much, and thank you everyone for coming, and thanks to everyone who I met this afternoon. I’ve really enjoyed my time here so far. Um so, my background is in um programming languages research. I was an OCaml hacker back in grad school. Uh for the last 5 years or so, I’ve written, you know, more Python than one should uh because I’ve been sort of in the weeds with code LLMs. There’s sort of three major strands to my research. So, what So, as you said, one is trying to make uh LLMs better at low-resource languages. This is a term of art for programming languages for which there is limited training data, such as OCaml. Um understanding how people use LLMs, and finally understanding how LLMs work under the hood as they do programming tasks. And I’m going to try to tie these three threads together in this talk. But before I get into any of that, um I want to talk about benchmarking. Um I want to big begin by introducing some benchmarks that my group has built for LLMs as they do tasks in low-resource languages such as OCaml. Um in this field, you know, without benchmarks, you’re just sort of, you know, taking shots in the dark. So, let’s talk about some benchmarks.

So, uh to do to understand our work, I think we need to turn back the clock to about summer of 2022. Um so, the summer of 2022, uh you can think of that as after Copilot, but before ChatGPT. So, um in the summer of 2022, um the the the labs that were training language models had begun to begun to sort of standardize on how they were evaluating models on coding tasks. OpenAI had released their Codex model that powered GitHub Copilot about a year ago. That model’s paper also introduced this benchmark called HumanEval. And it had been rapidly adopted. There were also other benchmarks such as MBPP, which were very much along the same lines. And other labs such as a group at Meta and a group at Salesforce had all begun to adopt these benchmarks for evaluation.

HumanEval is a benchmark suite of 167 problems. The MBPP benchmark from Google has 400 or so problems, a little bit easier, but again, sort of along these lines. The common theme they all have is there’s a prompt, there’s hidden test cases. Oh, and it’s all just Python. People were training these multilingual models, but they were being evaluated almost exclusively on their Python programming performance. And so, the natural question we asked is how did they do at other languages?

What we did is we took these benchmarks and developed a little suite of transpilers that would translate the benchmarks from Python into a whole suite of target languages. Given the Python prompt, we would turn it into a Rust prompt, doing a little bit of type inference to figure out what the type annotations need to be, and also mechanically translating the test cases. Once we do that, we had a parallel suite of benchmarks across a whole bunch of programming languages. We call the benchmark MultiPLE. It was the first large-scale multi-language evaluation of models of the time.

So, here’s the state of multi-language benchmarking today. I pulled these numbers from the Kimera 2 paper, which is like a very large open model that came out a couple of months ago. And as you can see, the benchmark is getting pretty saturated. The best models are at 90% or approaching 90%. And we know from work from others that about 10% of the problems are faulty in various ways. This is a recurring theme. In any large enough benchmark, some of the problems are faulty in some sort of way. And so, the benchmark truly is saturated, and so there’s a need for new benchmarks.

We did some recent work we call Agnostics, where what we realized is that we can just give these models much harder tasks to do now. Instead of asking them to just complete a function body, you can give them detailed instructions not just about the task to do, but exact input and output formatting. So, for example, you can take a problem from HumanEval and turn it into a problem that says, “Give me a complete program that takes input in the following format on standard in, and produces output in the following format in standard out.” And once you have a problem like that, you can just say, “Why don’t you solve this in OCaml?” And as long as you can compile and run OCaml, you sort of get a multi-language benchmark much more easily than we did before. In many cases, if you can have a capable LLM do this transformation to turn a language-specific benchmark in Python into a language-agnostic benchmark. If you do this to a benchmark, you get a benchmark. If you do it to training items, you get language-agnostic training items that you can use for reinforcement learning.

We’ve done some recent work on reinforcement learning for code LLMs applied to low-resource languages. I have numbers here for OCaml and Fortran, but the paper has a result for several other models and a couple of other programming languages as well. But the question I really want to ask is, are there other baselines that we should include? I’m saying that we trained this little model, and it does quite well in OCaml. It does really well in Fortran, it turns out. And what I mean is that it does much better than the model that we trained it up from, but are there other baselines that we should include?

To keep things simple for this talk, I’m going to talk about one little model that I trained up on OCaml. I don’t want to talk about Sonic 4. I don’t want to talk about the Qwen 3 model — this weird hybrid thinking model. I’m going to talk about Qwen Coder 2.5 3B Instruct. It’s a very vanilla little model. It’s actually a really strong model for its size. On a language agnostic version of HumanEval, when I ask it to solve the problems in OCaml, the base model gets about 10% of the problems right. But the trained model gets 17% right. So, this looks great. Significant improvement. But the question is, are there other baselines that we should include? And just to give you a sense of how hard this benchmark is, it’s still really easy for a frontier model. GPT-5 mini gets 72% of these problems right.

Before we get to other baselines, we’re going to take a little detour through the main technical part of this talk, which is we’re going to try to understand what is really going on inside these models. To answer this question, we’re going to work with the model I just introduced and a benchmark that is a language agnostic version of HumanEval. The way I constructed this is I had Sonic 4.5 translate these models and verify that the tests were consistent with the translation.

What I’m going to first do is have the model solve these tasks with two variations of this data set. In one variation, the prompts will be prefixed with “write an OCaml program to solve this problem.” In the second variation, the prompts will be prefixed with “write a Python program to solve this problem.” So, we’re going to end up with two parallel datasets, one for which the model should generate an OCaml solution, one for which the model should generate a Python solution.

Qwen Coder 3B Instruct is a chat model. When you give it a prompt like “write hello world in OCaml,” what the model actually receives includes special tokens inserted into the text stream to clearly mark what is the user message and what is the response from the model. And what these models actually receive as input is not text. They receive tokens. What the model really sees is a stream of integers.

What I’m not going to do is the usual thing, which is to ask what is the next predicted token. What I’m going to do instead is ask what are the intermediate values that the model produces as it’s doing its computation. I want to give a high-level schematic for people who haven’t seen this of what a model is. Models are made of layers. When we feed into a model a prompt such as “write FizzBuzz in OCaml,” you can think of the model as consisting of roughly three parts. The first part is the embedding layer, which maps the integers into high-dimensional vectors. N is 2048 in the case of the particular model that I’m working with. Then there are several transformer layers — the decoder blocks that refine what that vector is. And finally, there’s the unembedding layer, which maps from these vectors to a distribution over the space of tokens.

We’re going to read off the intermediate values, also known as activations or a residual stream, that the model produces as it goes through the transformer layers. The way I’m going to do this is using another project developed at Northeastern, which is called NNSight, and NDIF, the National Deep Inference Facility. NNSight is a DSL embedded in Python that makes it easy to both read in models’ internal states and even manipulate them. NDIF is an NSF-sponsored hosted GPU service onto which researchers in the US can query very, very large models.

We tokenize the prompt to get a stream of integers, feed the prompt through the model, and within this with block, we can sort of perform various manipulations. At every layer, we get the output for the layer and give me the last five tokens. So we have lots of high-dimensional vectors. We have five per layer and 36 layers. And that’s just for one prompt. And we have a dataset of 300 or so prompts — 150 for Python, 150 for OCaml. I want to look at these vectors. Looking at high-dimensional vectors is hard. The way I’m going to do it is by projecting them into 2D using principal components analysis (PCA). PCA is a linear method that learns a change of basis in high dimension from the standard orthonormal basis into a new basis where the basis vectors are ranked by the amount of variation in the data set that they capture.

At the first layer of the model, the OCaml points and Python points are basically on top of each other. The prompts are basically identical — one says OCaml, one says Python. But as we go down the layers, we start seeing some separation. The model is actually encoding some information about the language in which it needs to do the task. This is dimensionality reduction. This could all be a mirage. We’re going to see that it’s not. But there is some sort of separation appearing between the prompts that say do it in Python and the prompts that say do it in OCaml.

So here’s the first experiment we’re going to do. The average Python prompt’s intermediate value is somewhere there, and the average OCaml one is sort of there. What I’m going to do is compute that vector and add. I’m going to be stupidly lazy. I’m just going to do this vector addition at every single layer. This technique is called activation steering. It’s been used a bunch for natural language tasks. We’re seeing it for some programming tasks. We compute the set of intermediate values for Python, the list for OCaml, and compute this language-changing patch. It’s: we start at Python, step away from Python, and move toward OCaml. That’s it. Minus Python plus OCaml.

So what happens? I’m going to feed the model these prompts without any language-specific direction at all. We’ve basically changed the default programming language of the model from Python to OCaml. It doesn’t say “write this in OCaml,” but adding in that vector makes it generate OCaml.

I’m going to do a different kind of patching experiment next. That was just a warm-up. We need two parallel datasets. The two parallel datasets here are going to both be in OCaml — just the prompts, not the solutions. But it’s going to be OCaml problems that the model solves correctly and OCaml problems that the model fails to solve correctly. What I’m going to patch the model to do is step away from the intermediate values on the problems on which it solves the problems incorrectly, and step toward the intermediate values on the problems which it solves correctly. I’m relieved it didn’t reach as high as our trained model. That means our work wasn’t wasted. But the question to ask is: when you’re training up a model, there’s perhaps other baselines you should consider to ask the question, what is it that you’ve really achieved? If you can get a bunch of performance just by doing this — I mean, you wouldn’t want to do it, but clearly I’m not endowing the model with new knowledge. You have to ask, are you just aligning the model, or are you actually endowing it with new knowledge?

I also did some hand prompt engineering. In this case, I just told it what libraries I had installed. I said, “I don’t have any of the Jane Street public libraries installed.” Because it’s two dialects. The language really has two dialects, and you’ve got to know when you’re evaluating the model, are you actually evaluating the fact that your environment doesn’t have the Jane Street libraries installed, or do you want to encourage the model to write in that way? You just have to understand what it is that you want and make sure you’re measuring the right thing. My conclusion is that our effort doing RL was worthwhile on this model, and I think they’ll hold for the larger models as well. But the gap to the baseline has been narrowed from this experiment.

I want to move on to a third application of activation steering, which I think is actually the most interesting — using activation steering to understand why models mispredict types. I do not mean OCaml-style type inference. I mean predicting type annotations in languages with explicit type annotations. I’m going to work with Python and TypeScript. The better your model is at the task, the cleaner results you get. Models are really good at Python and TypeScript. On a prompt where the variable is named N but it’s clearly being used as a string, many weaker models will just get that wrong and say, “Oh, N, integer.”

Again, we’re doing activation steering, so there’s two parallel datasets. Dataset one is a combination of two datasets: TypeScript programs from the Stack (a GitHub dataset) that type check, and for Python, there’s a dataset called ManyTypes4Py, a dataset of Python programs with many type annotations. Both programs that type check using TSC or pyright. For my negative dataset, we do a bunch of semantics-preserving edits to the positive examples until the LLM mispredicts. Things like we take a program which type checks, rename the class Point to some “type zero,” rename a variable from X to temp, and eventually start dropping type annotations. We do this grab bag of semantics-preserving syntactic mutations until type prediction breaks. And that breaking prediction, which is model-specific, becomes the negative dataset for the model.

We’re able to correct a whole bunch of type predictions using the same activation steering methodology. We step away from the average intermediate values for the prompts on which the model mispredicts types and towards the average for the prompts where the model gets it right. You get different results when you patch at different layers, but you get up to 60% accuracy on these type prediction tasks. Baseline is zero. These are all tasks where the model just gets it wrong, but we get 60% right with the patch.

This is not the most interesting result. In languages like TypeScript and Python, when I say “predicts the type correctly,” what I really mean is the model fills in the type that the program had before. Whereas in these gradually typed languages, there’s always a type you can write down that will make the type checker happy, which is any. There are cases where the unsteered model predicted a type, the type was not expected, but it was still okay — it predicted any, which is not the type we wanted, but it’s technically still okay. We aren’t able to correct those. But when the base model predicts a type that was just wrong and would have led to a type error, we are able to correct those much more often. What the model is doing is not improving type precision — it really is correcting type errors, and those are two different things.

Finally, we have two sets of steering vectors: one for correcting Python type errors, another for correcting TypeScript type errors. There’s another question you can ask: what happens when you try to correct Python type errors with a TypeScript steering vector, or vice versa? What we find is that it’s just as effective no matter which way you do it. I think that’s pretty cool because that suggests there’s some sort of shared representation of type that the model is learning across at least these two languages, and I will speculate other languages as well. This is a strong signal that is the case as opposed to the model representing them in entirely different ways internally.

Wrapping up, we’re at this point where we just have a surface-level understanding of what LLMs can and cannot do, and we’re just beginning to dig deeper with interpretability techniques. Most interpretability research is not on programming tasks, but the formal properties of code — the ability to do semantics-preserving code edits, actually test the generated code, run the type checker — they make programming languages a really good platform for studying model internals.

However, asking what task an LLM can do is actually a really narrow question. We also need to understand humans’ mental models of LLM capabilities to truly understand what is possible. That’s going to be the second and shorter part of my talk.

Before that, a quick interlude about more recent work from us. This is a point about benchmarks. What we normally want to ask is, can a model do some task X? The problem with benchmarks is that they really ask the question “can a model produce the correct answer on a prompt P,” which is not quite the same thing. Here’s a prompt from a benchmark called ParEval. It’s a benchmark for parallel programming. It has CUDA benchmarks, ROC benchmarks. This is an OpenMP prompt and it’s a hard benchmark. But maybe the model will do a lot better if you add a little more detail to this very terse prompt. Something we did recently is we mechanically dialed up the detail to get all benchmark results to pass with high reliability. And then dialed down the detail again using a model to get prompts with less and less detail. We were able to generate nice curves that show that as you drive down the level of detail on various models, performance goes down in a predictable way.

I want to talk about prompts and people. There’s a lot of research in the space. According to my Sonnet summary of 60 paper titles from this year, about 20% of the papers are about LLMs in CS education. I’m part of the problem. I’ve been studying student-LLM interaction since ‘23. What’s interesting about our work is that it lends itself to interesting secondary analyses. Back in early ‘23, we did a study with 120 students who had all completed CS1 in Python and no other course. At Northeastern we used to have students learn Scheme; those students were excluded. So it’s just people who know Python.

What is early ‘23? ChatGPT had just been released. None of these students had used GitHub Copilot or ChatGPT. Many hadn’t even heard of ChatGPT. College students have better things to do than keep up on the latest tech news. So the high-level question is: how do students with zero LLM experience but basic programming knowledge do at prompting a state-of-the-art code LLM from ‘23?

The model is the largest OpenAI Codex model. Easily the best code LLM of the time. We had 120 students from three universities. When you ask a student who’s just done CS1 to use a model, you have to be careful what problems you give them. You can’t tell them to write a web server — they’ll say “what’s a server.” We were careful to pick trivial programming problems taken from their homeworks and exams. The task we gave them was only to write the natural language prompt. So we were focusing entirely on prompt-writing ability. We did not allow them to edit code. If the model produced wrong output they could give up, roll the dice again (models are non-deterministic), or revise the prompt. We would run unit tests for them, show them all unit test results. There were no hidden tests. This was a Zoom study, 60 minutes to do six tasks.

When you want to study student prompting ability you can’t tell them what the task is in English because they might just parrot it back. So we showed them test cases and the function signature and said, “Write a description of this function.” They hit submit. Codex thinks for a bit, produces code. We plugged it in as a docstring. We showed them expected output, actual output. They could try again or move on.

How do students do? Perhaps unsurprisingly, they don’t do super well. One question is: after infinite retries, how do they do? That’s the eventual success rate. It’s a wide distribution. Another question: if you count every failed attempt as a failure, how do they do? The success rate is much lower. And remember, this is with us giving perfect feedback — removing the fact that in real life you need to evaluate the model’s output yourself.

What’s more interesting is the dataset. We published this anonymized dataset of 2000+ student prompts with the full prompting trajectory for each student, including model-generated code, test results, and so on. One thing you can do with this dataset is turn it into a benchmark. We take the first and last prompt from each trajectory and get a benchmark with several prompts of varying quality per task. This benchmark is unique — in most benchmarks there’s one task and one prompt for the task. Since we have a sense of what are the good prompts (ones on which Codex succeeds) versus bad prompts (ones on which they eventually gave up), we can plot curves showing how various other models do when resampled on these tasks.

Looking at GPT-3.5: if you look at the prompts on which students succeeded — well, actually, a bunch of them were really low-quality prompts. They just got lucky. Perhaps worse is that for some prompts on which students failed, 25% or so were actually really good prompts. Students just got unlucky and gave up. They could have just rolled the dice again. Something else that’s clear: prompts where a student solved a problem in one shot in a single attempt tend to be more reliable than students’ last successful attempts. A prompt where a student succeeded but took multiple iterations — the intuition is you write a great prompt, you solve it. Whereas if you take multiple iterations, you end up dragging it on and adding more and more detail. There’s a thing I often find as a teacher: student will write code, it won’t work, so they write more code, and it won’t work, and they keep writing code, and the thing to do is: no, no, stop — just throw everything you have. It’s really hard to do. Students tend to do the same thing with prompts. Which is possibly worse than adding code, because we know these models, if you add more and more context, at some point just randomly ignore what’s in the context.

One last thing. The other thing you can do with these prompt trajectories is ask: what makes student-written prompts unreliable? Let me introduce a problem from our study — the total bill problem. It’s a grocery bill. You multiply quantity by price, then sales tax. You write a prompt to do this. During the study, 13 students attempted this problem. Many made multiple attempts. What’s hard is analyzing this natural language text. There’s a huge amount of variation in what people have written. It took us a year and a half before we had a breakthrough in how to understand these things.

My grad student Francesca came up with this: the question we asked is, what are the essential set of facts — we call them clues — that get added, removed, or updated in every attempt? Students are making ad hoc changes to prompts. But what is the actual information content of these prompts they’re changing? For any problem we can come up with the set of facts or clues necessary to solve it correctly. For the total bill problem we came up with eight of them based on analysis of the successful prompts: inputs are list, list structure is explained, round to two decimal places, etc. We label every edit by the delta it makes to the set of clues.

Let me walk you through one trajectory. The diamond represents a student making a change to their prompt. The circle represents the model generating code, our platform running tests, presenting results to the student, who then makes another edit. Green is a successful state. Reds are cases where students gave up. We cluster failures together — multiple students end up in the same state because it’s the same failure. Student 23: they look at the initial problem description and write their prompt. They’ve almost gotten it — they’ve added every single clue except clue number seven, which is you’ve got to round the answer to two decimal places. They observe a failure. They make an edit: they change “tax” to “taxes.” That doesn’t change the information content at all. Model produces another result, stochastic, fails differently. Student observes this failure and makes an edit: they go back to “tax” and add “which is the last two components of the list” — but they already said that. They modify their description of the list structure. Then they’re back where they were before. They observe the same generation and same failure. They do some rewording. They add fact four, but they actually delete the information about adding up the results. At this point, they give up. It sucks they’ve given up because they’re actually at a state where most other students realize in one more attempt that they just have to add clue seven: round to two decimal places. Intro CS students don’t have the mental model of: oh, when I see a number with 500 decimal places, that’s that floating point thing I’ve got to round. They haven’t actually learned floating point yet in any detail.

My takeaway is that students just don’t have a good mental model of what the model already knows or what the priors of the model are. We come up with findings such as: if all the clues are present, with high probability your prompt will solve the problem. If even one clue is missing, with high probability you will not solve the problem. If you get stuck in a cycle revisiting the same error state, with very high probability you will give up.

To conclude: everyone knows someone who claims LLMs make them two to four times more productive. I claim they make me significantly more productive. But we’re at a point where it’s not clear whether the randomly sampled developer gets benefit from LLMs. If you’ve looked at controlled studies like the recent META study, there’s evidence that in controlled environments, LLMs are actually slowing developers down. The META study includes some of the compiler hackers who work on NNSight and NDIF. That’s just a fact.

We’re at an interesting point. Coding agents are exploding in popularity. Here’s a graph from a recent paper. It’s really easy to mine GitHub commits by agents — they all say “co-authored by Claude Code” or have a link to a ChatGPT website. In the first 3 months or so since Claude Code was released, we gathered 1.3 million commits, and there’s many more in month four. There’s an enormous amount of data being generated by agents, and we’re beginning to look into how developers are actually using this in the wild in open source.

Q&A

Q: How do you see CS education changing with LLMs?

There’s probably a way to use them well, but I don’t think we know how to yet. I sincerely believe they can increase programmer productivity, but the goal when you’re in college is not to be productive. I’m not looking for 50 implementations of Pac-Man. When we assign a task like that, the goal is to get students to learn. When the model does the task for you, it’s not clear that one learns. Now, maybe the goal is to learn how to use a model — I teach a class where students learn how to use models better, and it’s fine when that is the task. But for most things people are learning, the model can short-circuit learning. We face an uphill battle.

Q: What was the students’ mental model of progress?

Great question. That was the first part of the paper I didn’t talk about. When you talk to students after they do the task, they said all sorts of things. No one said “language model.” When you ask them why it’s hard, they’d suggest it was syntactic things and vocabulary issues. The first part of the paper does a causal intervention experiment where we take prompts with bad vocabulary and substitute them with good vocabulary. For example, they’re doing Python, so we’d take prompts where it says “array” and change it to “list.” We find those interventions have no impact. Students don’t realize the problem is that they’re not conveying the right information. The finesse of grammar doesn’t really matter.

Q: Did you try inverting the patch in the type-correction work?

No, not in that paper. We do a random baseline — try to patch with random noise of the same magnitude. That has an effect in certain cases but not the pronounced effect of correcting type errors. It would be funny to try to introduce type errors, but we haven’t done that.

Q: What does it mean when the two groups are separated in PCA?

These are giant stacked classifiers. There’s some hyperplane being learned in high-dimensional space that the model is using to classify this set of prompts. I’m relying on my intuition that if I have two sets of prompts, one says do it in OCaml and the other says do it in Python, the model must be separating them some way. I’m finding the separation happens so consistently at every layer. It’s just classification. So it’s kind of like magic — as far as classification is magic, which I don’t think it is. We’re just exposing the fact that it’s a bunch of classifiers under the hood. Very complicated classifiers, but just classifiers.