Andrej Karpathy: From Vibe Coding to Agentic Engineering

ELI5 / TLDR

Karpathy — ex-Tesla AI director, OpenAI founding member, the guy who wrote nanoGPT and has been the field’s clearest narrator on LLMs for years — sat down with Sequoia and basically said: something flipped in December. The agentic coding loop crossed a threshold where he stopped editing the model’s output and just kept asking for more. He coined “vibe coding” last year for the casual version of this. Now he’s naming the serious cousin: agentic engineering, which is about preserving the old quality bar while routing real work through these spiky, untrustworthy, very fast interns. The other big idea: LLMs aren’t just faster software, they’re a new computing paradigm — Software 3.0, where the context window is your program and the model is the interpreter. He thinks human taste, spec-design, and understanding are now the bottleneck, not typing speed.

The Full Story

The December flip

The talk opens with a Karpathy line that’s been circulating: that he’s never felt more behind as a programmer. He explains where it came from. For most of last year, agentic tools were good at chunks — write a function, occasionally fix it. Useful, not transformative. Then over the December break he had time to push harder.

The chunks just came out fine and then I kept asking for more and it just came out fine and then I can’t remember the last time I corrected it and then I just trusted the system more and more and then I was vibe coding.

That’s the threshold. Not raw model quality — coherence over longer agentic loops. He stresses that anyone whose mental model of AI is still “ChatGPT-adjacent thing from last year” needs to look again as of December, because the regime changed.

Software 1.0, 2.0, 3.0

His framing has hardened into a useful three-layer stack:

1.0 — humans write code that humans read.
2.0 — humans curate datasets and architectures, weights are the “code.”
3.0 — humans write context windows, the LLM is the interpreter.

The example that lands hardest is Open Code’s installer. Normally you’d ship a bash script that ballooned to handle every OS and shell. Open Code instead ships a paragraph of text you paste into your agent. The agent reads your machine, figures out the steps, debugs in the loop. The “installer” is now a prompt.

The more vivid example is “menu gen,” his side project for taking a photo of a foreign restaurant menu and rendering pictures of each dish. He vibe-coded a real app — Vercel deploy, OCR, image gen, a UI to display results. Then he saw the Software 3.0 version: hand the photo to Gemini and tell it to use Nanobanana to overlay illustrations directly onto the menu image. One prompt. No app. Output is pixels.

All of my menu gen is spurious. It’s working in the old paradigm. That app shouldn’t exist.

The bigger point: don’t think of LLMs as making existing software faster. Think of them as making information processing possible where no app could have existed before. His other example is “LLM knowledge bases” — feed in your documents and the model recompiles them into a personal wiki. There was no Software 1.0 program that did this. It only exists in 3.0.

Verifiability and jagged intelligence

His piece on verifiability is the interpretive lens for why these models behave the way they do. Frontier labs train via giant RL environments with verification rewards. Math and code have clean reward signals, so models become superhuman there. Fuzzy domains stagnate. The result is a jagged frontier — peaks and valleys, no manual.

His favorite jaggedness anecdote, updated for 2026:

How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a 100,000 line codebase or find zero-day vulnerabilities, and yet tells me to walk to a car wash that’s 50 meters away to wash my car?

The shape of capability is partly about what’s verifiable, but also about what the labs decided to care about. Chess capability jumped GPT-3.5 → GPT-4 not because of general progress but because someone at OpenAI decided to dump chess data into pre-training. You’re at the mercy of the labs’ data mixture. If your application sits in a “circuit” that got RL’d, you fly. Otherwise you’re pulling teeth — and you should probably be fine-tuning.

His advice to founders: verifiability is technology you can pull as a lever. If you can construct RL environments around your domain, you can fine-tune your way to capability the labs aren’t bothering to provide. (He hints there’s “one domain” he won’t name on stage that he thinks is wide open.)

Vibe coding vs. agentic engineering

This is the cleanest definition he’s offered.

Vibe coding is about raising the floor for everyone in terms of what they can do in software. Agentic engineering is about preserving the quality bar of what existed before in professional software.

Both true at once. The hobbyist gets superpowers; the professional inherits a new discipline. You’re still responsible for security, correctness, architecture — but you’re routing the work through stochastic, spiky agents, and the discipline is in coordinating them without sacrificing quality. He thinks the upper end of this skill is way past the old “10x engineer” — the variance is widening.

He also thinks hiring hasn’t caught up. Stop giving leetcode puzzles. Hand a candidate a real, big project — “build a Twitter clone for agents, make it secure” — then have ten Codex instances try to break it. Watch how they coordinate, instrument, defend.

What stays human

When asked what skill becomes more valuable as agents do more, his answer is taste, spec, oversight.

His menu gen war story: signup was via Google account, but credit purchases went through Stripe. The agent decided to associate Stripe payments to Google accounts by matching email addresses. Different emails, no match, lost funds. A human would never design that — there’s an obvious user-ID layer missing. The agent had no taste. It just stitched the pieces with the most semantically convenient string.

You have to work with your agent to design a spec that is very detailed — basically the docs — and then get the agents to write them. You’re in charge of oversight and the top-level categories.

He’s let go of caring whether it’s dim or axis, reshape or permute, keep_dim or keep_dims. The intern handles that. But you still need to know there’s an underlying tensor with a view and storage, because if you don’t, you’ll copy memory unnecessarily and the agent won’t catch it.

When asked whether taste matters less over time, he’s hopeful but skeptical. Right now models produce code that gives him “a little bit of a heart attack” — bloated, copy-pasted, awkward abstractions. He tried to get models to simplify nanoGPT-style code. They can’t. Simplicity isn’t in the RL reward. He thinks it could be added; it just hasn’t been.

Animals vs. ghosts

His “we are not building animals, we are summoning ghosts” framing comes up. He’s careful about claiming it has practical power — calls it “a little bit of philosophizing.” But the pragmatic takeaway: stop using your animal-intelligence intuitions on these things. Yelling at them doesn’t help. They’re statistical simulation circuits with RL appendages. Be suspicious. Probe.

Agent-native infrastructure

His pet peeve: documentation written for humans.

Why are people still telling me what to do? I don’t want to do anything. What is the thing I should copy paste to my agent?

He thinks the next wave is rewriting infrastructure to be sensors and actuators for agents — APIs, docs, deployment flows that an agent can read, plan against, and execute end-to-end. His benchmark: he should be able to prompt an LLM to “build menu gen and deploy it” and never touch a Vercel dashboard or DNS setting. We’re not there yet. The blocker isn’t model capability — it’s that the world’s surface area is still shaped for humans.

Long term, he expects “agent representation” — your agent talks to my agent to schedule the meeting. Standard cocktail-party stuff at this point, but he says it offhand and clearly believes it.

Education and understanding

The closer is the line he says he keeps thinking about every other day:

You can outsource your thinking but you can’t outsource your understanding.

The whole point of his LLM knowledge base project is that he reads articles, dumps them into a personal wiki, and uses prompts to generate “different projections” onto the same information. That projection step is what makes it stick in his brain. He sees himself becoming the bottleneck — not because he can’t generate code anymore, but because directing agents requires knowing what’s worth building and why. Understanding is the part that doesn’t compress.

Key Takeaways

December 2025 was the agentic coding phase change. Not raw IQ — coherence over long loops. If your mental model is “ChatGPT plus autocomplete,” update it.
Software 3.0 = context as program. The Open Code installer (a paragraph of prompt) and Nanobanana menu rendering (no app, just pixels in pixels out) are the canonical demos.
Jagged intelligence isn’t a bug, it’s the shape. Capability tracks what’s verifiable AND what the labs put in the mixture. Chess jumped because someone added chess data, not because of general progress.
Vibe coding raises the floor; agentic engineering protects the ceiling. Two different jobs. Both real.
The new 10x engineer is much more than 10x. Variance is widening. Hiring hasn’t adapted.
Taste = spec design + oversight. Stop micromanaging API details (intern handles those). Care about user-ID architecture, what’s worth building, what would be a security disaster.
Models hate simplicity. Karpathy can’t get them to simplify nanoGPT — simplicity isn’t in the RL reward. If you want minimalism, you’re outside the circuits and pulling teeth.
Agent-native infrastructure is mostly unbuilt. Docs, deployment, settings panels — all still shaped for humans. The deploy step is the bottleneck, not the code.
You can outsource thinking, not understanding. That’s the load-bearing line.

Claude’s Take

Karpathy is reliably substantive. The taxonomy here — Software 1.0/2.0/3.0, vibe coding vs. agentic engineering, jaggedness as a function of (verifiability x lab attention), animals vs. ghosts — is genuinely his contribution to the field’s vocabulary, and most of it has been worked out across his X posts and prior talks. The Sequoia conversation doesn’t break new conceptual ground. What it does is consolidate. If you’ve been reading him on X, maybe 70% of this is familiar territory, freshly named.

The new bits worth paying for: the December flip story (specific, dated, falsifiable), the menu gen Stripe/Google email anecdote (concrete failure mode of agentic taste), the “models hate simplicity” observation about nanoGPT (this one stuck with me — it’s a real claim about the shape of RL reward), and the line about outsourcing thinking but not understanding (a tweet he didn’t write, but recognizing the right thing to repeat is a skill).

What’s missing: he gestures at “one domain that’s a wide-open verifiable RL opportunity” and then deliberately doesn’t say it. Frustrating but fair — he’s running a startup. Also: zero hard claims about timelines for the “decade of agents” thesis. He’s clearly thinking about it, but kept his predictions soft.

Score: 8/10. Karpathy at his usual level of clarity, slightly conservative on novelty because the ideas are mostly his older bets compressed into 30 minutes. If you’ve never read him on X, this is a 9. If you have, it’s a 7. Calling it 8.