Arjun Guha: How Language Models Model Programming Languages

ELI5/TLDR

A computer science professor cracks open a language model like a watch, pokes around inside, and finds that the model secretly keeps a little mental note of which programming language it’s supposed to write in — and you can edit that note to make it switch languages mid-sentence. Then he pivots to something almost opposite: he watches a hundred college students try to get an AI to write simple programs, and discovers they fail not because their grammar is bad but because they forget to tell the AI obvious things, like “please round to two decimal places.” Both halves land on the same quiet punchline: we’re only just starting to understand what these models know, and we’re even further from understanding what we know about them.

The Full Story

Arjun Guha is a professor at Northeastern who splits his time between two very different questions about large language models and code: what’s actually going on inside the model’s head, and what’s going on inside the heads of the people prompting it. His talk at Jane Street is really two talks stapled together, and the seam is instructive.

The benchmarking problem, or: why nobody can tell if OCaml is getting better

Back in 2022 — the cliff-edge moment between GitHub Copilot and ChatGPT — every lab training a code model was using the same benchmark to measure progress. It was called HumanEval, 167 little problems where the model gets a function signature, a docstring, and has to fill in the body. Simple stuff. The only catch: it was all Python. Every major lab was quietly training “multilingual” models and then grading them on a single language.

Guha’s group built a machine to mechanically translate these Python benchmarks into other languages — Rust, OCaml, a whole menu of them. Think of it like taking a standardized math test written in English and running it through a translator so you can give the same test to students in Portuguese or Swahili. Now you can actually compare apples to apples. They called it MultiPLE, and it was the first large-scale test of how these models handle languages other than Python.

Fast forward to now: the benchmark is basically solved. Top models score around 90%, and researchers have since found that roughly 10% of the problems are broken anyway.

In any large enough benchmark, some of the problems are faulty in some sort of way.

Which is a nice reminder that the ruler you’re measuring with is usually slightly warped.

Low-resource languages and the baselines you forget to check

Guha’s group works a lot on OCaml — a language he loves from grad school and one of the “low-resource” languages in machine learning terms, meaning there just isn’t that much of it on the internet for models to learn from. Training a model to be better at OCaml is a reasonable thing to do, and they’ve done it. But he wants to make a deeper point first, and to make it he takes us on a detour into the guts of a transformer.

Here’s how to picture what he’s about to do. A language model takes words, turns them into numbers, shoves those numbers through a long assembly line of layers, and spits out a guess at the next word. Each layer takes the numbers from the previous layer and nudges them in a particular direction. Normally, researchers look at the final output — the guess. Guha instead looks at the intermediate states, the half-finished thoughts the model has as they pass through the assembly line. These are usually called “activations,” sometimes the “residual stream,” and you can think of them as snapshots of the model in the middle of thinking.

He takes two piles of nearly-identical prompts. Pile A says “write this in Python.” Pile B says “write this in OCaml.” Otherwise the two piles are the same task. He feeds both through the model and looks at the internal states at every layer. At the very first layer, the Python and OCaml prompts are basically on top of each other, which makes sense — only one word is different. But as you go deeper into the model, the two piles start to drift apart. By the middle layers, they’ve separated into two clear clusters. The model is, in some internal way, keeping track of which language it needs to write in.

Then Guha does something that feels like a magic trick. He takes the average “Python-ness” direction and the average “OCaml-ness” direction in this high-dimensional space, subtracts one from the other, and gets what you might call a “language vector.” Now he gives the model a prompt with no language specified at all — just the task — and adds this vector to the model’s internal state as it thinks. The model, which would normally default to Python, starts writing OCaml. He’s essentially flipped a switch the model didn’t know it had. The technique is called activation steering. People have used it on natural-language tasks before, but he’s showing it works cleanly on code too.

It doesn’t say write this in OCaml, but adding in that batch makes it generate OCaml.

This is the warm-up. Here’s the punchline.

Guha trains a small model on OCaml and finds it improves from getting 10% of problems right to 17%. Respectable gain. But he wants to know: is this “real” improvement, or is he just pushing the model into an OCaml-shaped posture it already knew how to assume? So he does the activation steering trick again, but this time the two piles are “OCaml problems the untrained model got right” and “OCaml problems it got wrong.” He computes the difference vector — call it the “success direction” — and patches it into the model at every layer. No training required. Just vector addition. The patched model gets a chunk of the improvement for free.

He’s visibly relieved the trick doesn’t quite match his trained model. (It would mean his training work was pointless.) But the gap has narrowed, and the lesson is uncomfortable: when you “train” a model to be better at something, you may not be teaching it new knowledge so much as aligning it with knowledge it already had.

Why models get types wrong, and what that tells us about types

The most interesting activation-steering experiment is the third one. Guha looks at type prediction — the task of guessing the right type annotation for a variable in Python or TypeScript. Models are mostly good at this, but they fail in predictable ways. If a variable is called n, the model will confidently label it an integer, even when the surrounding code clearly treats it as a string. The model is pattern-matching on the name and ignoring the evidence.

He builds two piles again: prompts the model gets right, and prompts the model gets wrong. (To build the “wrong” pile, he takes correct programs and mutates them without changing their meaning — renaming Point to TypeZero, renaming x to temp, things like that. He keeps mutating until the model starts mispredicting.) He computes the now-familiar steering vector and patches it in. He can correct up to 60% of the previously-wrong type predictions this way. Baseline is zero.

But then the real finding. He does this separately for Python and TypeScript, producing a Python steering vector and a TypeScript steering vector. What if you use the TypeScript vector to fix Python errors? It works just as well. The internal “direction” that corrects type errors isn’t language-specific — the model has learned some shared, cross-language representation of what “a type” is. Think of it like a bilingual person who, when they suddenly forget the right word in French, can find their way back to it by thinking in Italian. The model seems to have an underlying concept of type that lives above any specific language.

Pivoting to the people

Halfway through, Guha swaps topics. He wants to talk about the other half of the human-AI loop: the humans.

Back in early 2023, when ChatGPT was days old and most college students hadn’t heard of it yet, Guha ran a study with 120 CS1 students. The setup: they had to write natural-language prompts — no code editing allowed — to get a state-of-the-art code model to solve tiny programming problems. The model gave feedback, the students iterated. Sixty minutes, six problems. How do they do?

Not great. Even with unlimited retries and perfect feedback, the success rate is a wide and disappointing distribution. But the real value is the dataset they built: 2000+ prompt trajectories showing exactly how students thought their way toward (or away from) a working solution.

One analysis is sobering. Guha looks at which student-written prompts actually worked and which didn’t, then retests them on newer models. A bunch of the “successful” prompts only worked because the student got lucky on a non-deterministic model. A bunch of the “failed” prompts were actually fine, and the student just happened to roll snake-eyes and gave up. More tellingly: prompts where a student succeeded on the very first try were more reliable than prompts where they succeeded after many iterations. Why? Because when students iterate, they don’t fix the prompt — they pile more text onto it, hoping some of it sticks. This mirrors something Guha sees in normal programming: students who can’t get their code to work respond by writing more code.

The thing to do is like no, no, stop. Just throw everything you have. It’s really hard to do.

His grad student Francesca came up with the sharper analytical frame. For any given task, there’s a set of essential facts — call them “clues” — that any working prompt needs to contain. For the “total bill” problem (multiply quantity by price, add sales tax, return a total), there are eight clues. Things like: the input is a list, the list has this structure, the answer needs to be rounded to two decimal places. You can then watch each student’s prompt-editing history and tag every edit by which clues got added, removed, or changed.

The patterns that fall out are surprisingly clean. If all the clues are present, the prompt almost always works. Miss even one clue and it almost always fails. If a student ever revisits the same error state twice, they almost always give up shortly after. And most haunting is the portrait of Student 23, who got seven of the eight clues on the first try, then spent the rest of the session changing “tax” to “taxes,” rewording their description of a list, and deleting one bit of correct information while adding another — never once landing on the missing clue, which was “round to two decimal places.” They gave up one step away from the answer. Guha notes that a CS1 student doesn’t yet have the mental model to think “oh, floating point, I need to round” — they haven’t learned floating point in detail yet.

The broader takeaway is that students fail not because their grammar is wrong but because they don’t know what the model already knows. They think they’re being misunderstood when they’re actually being under-informed.

The shape of things now

Guha closes with a gesture at the current moment. Controlled studies like the recent METR study (he calls it “the META study”) suggest LLMs can actually slow down experienced developers, at least under experimental conditions. Meanwhile, coding agents like Claude Code are producing industrial volumes of commits — his group mined 1.3 million in the first three months after Claude Code launched, with more every month. There’s an enormous amount of data being generated by agents working in the wild. We’re just starting to look at what it means.

Claude’s Take

This is a good talk, and the two halves are stronger together than apart, though it takes a second to see why. The first half is about the model’s internal representations being more legible and more manipulable than we might have assumed. The second half is about the human’s internal representations of the model being worse than we might have assumed. Both halves are arguing that the naive question “what can an LLM do” is incomplete, because the real loop is a feedback system between two minds that don’t model each other very well.

The activation-steering results are solid and fit into a respectable mechanistic interpretability literature. The strongest finding — that Python and TypeScript type-correction vectors are interchangeable — is a genuine contribution, not just a parlor trick. It’s also the kind of result that needs independent replication before you lean on it hard. Two languages with closely overlapping communities of users, both gradually-typed, both in the training data in massive quantities, may share representations in ways that don’t generalize to, say, Haskell and C++. Guha is careful to say he’ll “speculate” about other languages, and that hedge is appropriate.

The baseline point — that activation steering can recover a chunk of the improvement you thought you got from RL fine-tuning — is the most important thing in the first half, and it’s the kind of result that should worry anyone publishing fine-tuning papers without checking it. It doesn’t mean fine-tuning is fake. It means we don’t always know what we’re actually buying when we fine-tune.

The student study is less dramatic but arguably more useful in the long run. The “clues” framework is a clean analytical tool and the findings pass the smell test: anyone who has watched a non-specialist try to get an LLM to do something has seen the “pile more words on it” failure mode. The Student 23 story is damning not because the student was bad at prompting but because they were one trivial edit away from success and didn’t know it. That’s a learning-about-the-model failure, not a prompting failure. It’s also unfalsifiable in a particular way — you can always claim, after the fact, that a failed prompt was missing a “clue,” because the clues were defined partly by looking at which prompts worked. The framework is still useful as a descriptive lens, just don’t mistake it for a predictive theory.

One thing worth noting: the student study is from early 2023, on Codex. The LLM landscape has changed enough since then that some findings may not transfer. Today’s models forgive missing context more readily than Codex did. The deep finding — that novice prompters don’t have accurate models of what the model knows — will probably survive generations of model upgrades. The specific numbers probably won’t.

The bit about the METR study showing LLMs slowing down experienced developers is the one claim I’d fact-check before repeating. Guha references it almost in passing, and the original study is more nuanced than “LLMs slow you down” — the developers in question were experienced on large open-source codebases they knew deeply, which is precisely the regime where LLM assistance helps least. In other settings the effect goes the other way. Guha knows this, I think; the talk just doesn’t have time to unpack it.

Overall: an honest researcher showing his work, not overclaiming, occasionally revealing more than he means to. The most interesting line in the whole talk is the one where he admits he was “relieved” his activation-steering trick didn’t match his trained model. It’s a small moment of scientific self-awareness that most conference talks don’t contain.