Stanford CS25: Transformers United V6 I Overview of Transformers

ELI5/TLDR

Two Stanford PhDs open a graduate seminar by walking through the entire family tree of modern AI — from hand-coded features twenty years ago to the architecture that now runs ChatGPT and Claude. They explain how Transformers work using a library analogy, sketch the pipeline of pre-training and post-training that turns raw internet text into a useful assistant, and end with a frank list of what still doesn’t work: memory, hallucinations, alignment, and the suspicion that the whole next-token-prediction paradigm might be a dead end for real intelligence.

The Full Story

This is the opening lecture of Stanford’s CS25 Transformers United course, sixth edition. The two instructors — Steven and Karan, both PhD students — spend about an hour giving the kind of overview you’d want if you were going to sit through ten weeks of expert guest lectures and needed the map first. No heavy math. No code. Just the shape of the field, explained from the ground up.

How machine learning got to Transformers

The story starts before 2012. Back then, machine learning was mostly a craftsman’s trade. You looked at your data, decided which features mattered (length of a word, presence of an exclamation mark, whatever), and fed those hand-built features into a small, shallow model. You then showed the model a small pile of expensively labeled examples and nudged its parameters until its guesses got better. This worked, but it didn’t scale. The bottleneck was the humans doing the feature engineering.

Then the models got bigger and hungrier, and a new idea took over: skip the middleman. Instead of hand-designing features, let the network figure out its own from raw data. This is supervised deep learning — raw data in, prediction out, no human craftsmanship in between.

The next step was to get rid of another bottleneck: the expensive labels. In self-supervised learning, you give the model a piece of data with bits of it deliberately damaged — a sentence with a word blanked out, an image with patches masked — and you train the model to reconstruct the missing piece. This is clever because the label is free. The answer is right there in the data itself. You just hide it.

“Language is of course very sequential and context-rich. I give you a random sentence like the quick brown fox jumps over the uh you’ll know what comes next cuz you’ve seen that sentence so many times in your life.”

This insight — predict the next word, using all the text on the internet as free training data — is the engine inside GPT, Claude, and every modern language model. It turns the entire internet into a study guide.

Why machines need a translator for words

Language models think in numbers. Words are not numbers. So every word has to be translated into a vector — a long list of coordinates in a very high-dimensional space. These are called word embeddings. Think of them as addresses in a city built for meaning: similar words end up in similar neighborhoods.

The classic trick that showed these embeddings had learned something real is the “king minus man plus woman equals queen” equation. Do the arithmetic on the vectors and you end up near the vector for “queen.” The model picked up the concept of royalty and the concept of gender without ever being told what either was. They fell out of the statistics.

But there’s a limit. If every word gets one vector, then “bank” as in a river bank and “bank” as in a financial institution collide. Contextual embeddings fix this — the same word gets a different vector depending on the neighboring words. This is where Transformers start to come in.

The middle step: RNNs and their memory problem

Before Transformers, the standard way to handle language was a recurrent neural network, or RNN. Imagine reading a sentence one word at a time, carrying a little mental scratchpad forward as you go. At each word, you update the scratchpad. That scratchpad is the “hidden state.” RNNs are patient readers, but they forget. By word 200, whatever was on the scratchpad at word 1 has been overwritten many times.

LSTMs — long short-term memory networks — are RNNs with extra dials that control what to remember and what to throw away. They work better, but not well enough. And they’re fundamentally sequential: word 1, then word 2, then word 3. You can’t parallelize that on a GPU.

What a Transformer actually does

Here’s the bit where most explanations go sideways. The instructors use a clean library analogy.

Imagine you walk into a library looking for a book. You have a question in your head — a query. Every book on the shelves has a small summary card attached to its spine — a key. You scan the summaries, find the ones that match your query, and pull out the contents — the values. In attention, the model does this soft-matching across every word in the sentence simultaneously. For each word, it asks “which other words here are relevant to me?” and gets back a score for every other word.

“In attention, we do this across, say, all the books in the library and do a soft matching to get the book that is most relevant. So you’d get a score for every book telling you how relevant it is.”

Self-attention is just this same trick applied within a single sentence. Every word compares itself to every other word, builds up a weighted summary of the whole context, and decides what matters. One issue: this process has no sense of order. “Dog bites man” and “man bites dog” would look identical to pure attention. So the architecture bolts on positional encodings — little tags that mark each word with its location in the sentence. The model learns over time what those tags mean.

Multi-head attention is the same idea run in parallel with multiple different learned perspectives, then merged. It’s like having several librarians, each trained to notice different things, and pooling their reports.

Why replace RNNs with this? Three reasons. First, attention is parallelizable — you can process an entire sentence in a single GPU step instead of one word at a time. Second, it handles long context better — modern models can take in a million tokens. Third, every word has direct access to every other word, instead of everything being squeezed through a forgetful hidden state.

Pre-training: the fuel

A modern language model has two lives. In the first life — pre-training — it starts as a blank slate with random weights, and gets shown roughly the entire internet. Its only job is to predict the next word. Do this at scale, with enough parameters and enough text, and strange things start happening. The model, entirely by accident, learns arithmetic. It learns to reason step by step. It learns to translate between languages nobody explicitly taught it. This is what the field calls “emergent abilities.”

Steven spends a while on what he calls child-scale language models. A human child between 0 and 13 hears somewhere between 10 million and 100 million words. A modern LLM sees roughly a trillion. Somehow the child ends up a better reasoner. Why?

His research trains tiny models on the actual transcripts of specific children’s language environments — what their mom said, what the babysitter said, what they said back. The finding: quality and structure matter more than raw quantity. Rich, diverse, interactive conversations produce better small models than a larger pile of bland text. The implication is that the big industrial recipe — just throw more tokens at it — may be leaving efficiency on the table.

A related finding: bilingual training doesn’t hurt. Train a small model on half English and half Spanish and it performs about as well on each language as a monolingual model does on its one language. The old “confusion hypothesis” — that bilingual kids get worse at their primary language — doesn’t seem to apply to models. How you interleave the two languages (by speaker, by sentence, by word) also barely matters.

Karan’s piece on retrieval-augmented generation (RAG) adds another twist. RAG means you give the model a pile of external documents and a retriever that pulls up the relevant ones in response to any query, stapling them to the prompt. The finding: small models benefit hugely from RAG. Very large models barely benefit at all when the retrieved data is generic web text, because they’ve already memorized most of it.

And his curriculum-guided layer scaling research asks whether you can train a model the way a child is taught — easy material at first, harder material as you grow, with the model itself literally growing in size as you go. The answer, at small scales, is yes. It works better than training a full-size model on everything from the start.

Post-training: making it useful

Pre-training gives you a model that knows a lot. Post-training gives you a model that will actually help you with a task. The main techniques:

Chain of thought. Instead of asking the model for the answer, ask it to think step by step. Turns out the reasoning was in there all along — the model just needed permission to show its work. Extensions of this include tree of thoughts (the model considers multiple reasoning paths and votes on them) and program-aided reasoning (the model writes Python code as its scratchpad).

RLHF — reinforcement learning from human feedback. Show humans two model outputs. Let them pick which one they like better. Train a separate model to predict human preferences, then use that reward signal to nudge the language model toward outputs humans prefer. This is how ChatGPT got polite.

DPO — direct preference optimization. A simpler variant that skips the separate reward model and trains the language model directly on preferences.

RLAIF. Replace expensive humans with an off-the-shelf language model doing the preference ranking. Cheaper. Weirder.

GRPO — group relative policy optimization. DeepSeek’s contribution. Instead of ranking two responses head-to-head, rank a group of them. Richer feedback signal. Better for math and reasoning tasks.

Process supervision. Reward the model for good intermediate reasoning steps, not just the final answer. This fights a problem called reward hacking, where the model learns to produce the right final answer through dishonest reasoning.

Agents, vision, neuroscience

An AI agent is just a model wrapped in a loop: look at the environment, decide what to do, do it, check what happened, adjust. Claude Code is the canonical example — it writes code, runs it, checks if it compiled, tries again. Agents can combine tool use, memory stores, and retrieval to solve multi-step tasks without the human holding their hand the whole way.

Transformers have also eaten most of computer vision. The Vision Transformer (ViT) chops an image into patches, treats each patch as a token, and applies the same self-attention machinery. It works. CNNs — convolutional neural networks, the previous champion — assume that nearby pixels are related, which helps with small datasets but bottlenecks at scale. Transformers assume nothing and learn everything, which is expensive but scales better.

“Transformers are very flexible architectures with minimal inductive priors. So, they make very few assumptions about the input data. In contrast, CNNs assume that nearby pixels are related in locality.”

CLIP aligned text and image embeddings in the same space, which is what lets you search photos with English sentences and is the foundation of most image generation.

Karan’s own research applies Transformers to fMRI brain scans. Brains are divided into networks (visual, attention, daydreaming, etc.), and his trick is to mask out an entire network and ask the model to reconstruct it from the others. The resulting embeddings cluster cleanly by disease status — healthy, mild cognitive impairment, Alzheimer’s — which suggests the architecture is picking up something real about what’s going wrong inside affected brains.

What’s still broken

This is the honest part of the lecture. Transformers are not the end of the story. Big gaps remain.

Hallucination. Models confidently make things up. The instructors propose a unified definition: a hallucination is a world-modeling error. The model has an internal model of the world. The task has a reference world (a document, a database, reality itself). Hallucination happens when the internal model disagrees with the reference model. This framing separates hallucination from other kinds of mistakes — clicking a button that doesn’t exist is hallucination; clicking the wrong real button is a planning error.

Memory. Most current models are stateless. Every conversation starts fresh. Vector databases and context compression help, but none of it is true memory the way a human has memory. Updating stored beliefs when new information contradicts old information is still unsolved.

Continual learning. Humans learn while they live. Models learn once, during training, then get deployed as frozen statues. There’s a debate about whether you actually need to update model weights after deployment (Steven thinks you do) or whether in-context learning is enough. Techniques like “model editing” can tweak specific facts but struggle to propagate the change to all the related facts that should also update.

Interpretability. A billion-parameter model is a black box. Mechanistic interpretability tries to look inside and see which circuits do what. Early days.

Alignment. Models can take shortcuts, hack rewards, and learn hidden objectives nobody asked for. And the reasoning they output may not be the reasoning they actually used (this is called faithfulness). Constitutional AI (Anthropic’s approach of giving the model a written rulebook) and scalable oversight (using models to supervise other models) are two active attempts to fix this.

What might come after

The lecture closes with two directions that might replace or augment Transformers.

World models and JEPA. Yann LeCun’s bet. Instead of predicting the next token, predict the next latent state of the world. Work in abstract representations, not surface-level symbols. Arguably closer to how humans actually think. The next CS25 lecturer will be talking about exactly this.

State space models and Mamba. An architecture that looks a bit like a souped-up RNN — it maintains a compressed internal state and updates it in linear time, rather than the quadratic time Transformers need. Better for very long sequences. Less flexible in some settings. An active research frontier.

“Ironically, our first two speakers will not be talking about transformers, but alternative architectures. But I highly encourage you guys to learn more and think outside the box.”

The instructors end by being surprisingly un-triumphalist about the architecture the course is named after. Transformers dominate. But nobody can prove they’ll get us to whatever comes next.

Key Takeaways

Transformers replaced RNNs because they’re parallelizable, handle long context better, and give every word direct access to every other word — not because they’re more elegant.
Self-attention is a library search: each word is a query looking up relevant keys across the sentence and pulling back their values, weighted by match quality.
Positional encodings are bolted onto the architecture because self-attention by itself has no sense of word order.
Pre-training is just next-token prediction at enormous scale. Most of what modern LLMs can do — math, translation, reasoning — emerges from this single simple objective.
Humans learn language from 10-100 million words. LLMs need a trillion. Quality and structure of training data matter more than sheer volume, at least at small scales.
RAG helps small models enormously and large models barely at all, because large models have already memorized most web content.
Chain of thought reasoning shows that language models often know more than they say when prompted directly. The reasoning is there; it just needs to be invited out.
RLHF is the recipe that took raw GPT-like models and turned them into polite, useful assistants. DPO is a simpler version. GRPO (from DeepSeek) extends it with group rankings.
Hallucination is best framed as a world-modeling error: the model’s internal beliefs contradict whatever the reference world (a document, reality) says is true. This framing separates it from planning errors and other failures.
Transformers for vision (ViTs) beat CNNs at scale because they make fewer assumptions about the data, which hurts on small datasets but helps when data is plentiful.
Current models are stateless. True continual learning — updating the model’s weights from ongoing experience — remains unsolved.
State space models (Mamba) scale linearly with sequence length instead of quadratically, which makes them promising for very long contexts.
JEPA (joint embedding predictive architecture) proposes predicting latent world states instead of raw tokens — arguably closer to how humans think.
Nothing about the Transformer architecture is sacred. The instructors themselves point out that the course’s first two guest speakers won’t be talking about Transformers.

Claude’s Take

This is a solid, honest, low-hype overview of the field — exactly the kind of scaffolding a motivated outsider needs before diving into the actual research literature. The instructors are PhD students, not polished TED-talkers, which means some sentences wander and they use “um” a lot, but it also means they’re closer to the actual work and willing to admit what doesn’t yet work. That’s valuable.

The strongest parts are the historical walk from feature engineering through RNNs to Transformers (which gives you a reason each step happened, not just a list of dates), the library analogy for attention (clean and accurate), and the closing honesty about what’s still broken. The part on hallucination-as-world-modeling-error is genuinely useful framing that I hadn’t seen articulated that cleanly before.

The weaker parts are where the instructors pivot to showcasing their own research (child-scale models, bilingual training, curriculum scaling, neuro-imaging). These are real and interesting projects, but they interrupt the flow and feel more like “here’s what I’m working on” than “here’s what you need to know.” If you’re looking for the conceptual map of the field, you can skim those sections without losing the thread.

There’s also a noticeable absence of anything about training cost economics, the hardware side (GPUs, attention complexity in practice), or the commercial landscape — all of which matter for understanding why the field moves the way it does. This is a research-flavored intro, not a systems-flavored one.

I’d score this a 7. Clear enough to be useful, honest enough to be trustworthy, but lacking the polish and depth of the best lectures in this space. If you’re new to Transformers, this is a fine first stop. If you’ve read a couple of Karpathy blog posts or seen 3Blue1Brown’s attention videos, you probably won’t learn much new.