Context Is the New Code — Patrick Debois, Tessl

ELI5/TLDR

Patrick Debois coined the word “DevOps” back in 2009. His thesis now: when you stop writing code by hand and instead steer an AI coding agent, the real artifact you’re producing isn’t code — it’s context (prompts, agent.md files, skills, docs you pulled in). And if context is the new code, it deserves the same lifecycle code got: generate it, test it, package it, distribute it, observe it in production. He calls this the Context Development Lifecycle. The talk is mostly an unpolished sketch of what each loop step looks like.

The Full Story

The shift Debois is naming

The premise lands in the first ninety seconds. Debois barely touches code anymore — he prompts. The pieces he used to write as helper functions, he now writes as skills: a markdown file that tells the agent “first figure out what package manager they use, then figure out their ecosystem, then walk through these steps with the user.”

Code is also transforming back into context as a skill, as a workflow that’s reusable.

This is the move. Twenty years of programming wisdom went into making code testable, versioned, reviewed, packaged, scanned. Context is currently treated like a Post-it note. Debois thinks the same machinery — CI, registries, linters, observability — has to be rebuilt around context, and most of the talk is him sketching what each piece looks like.

The infinity loop

He draws a familiar DevOps shape: Generate → Test → Distribute → Observe → (back to Generate). The talk walks the loop.

Generate. Five sources of context, ordered roughly by sophistication:

Plain prompting — you typing in the chat box.
Reusable instructions — agent.md files (he good-naturedly bashes Anthropic for sticking with Claude.md).
Pulled-in library docs, because LLMs hallucinate version 2 vs version 3 APIs.
MCP-style pulls from GitLab, Slack, Jira tickets — anything the agent can fetch.
Spec-driven development — you write a specification, the agent breaks it into a plan, the plan becomes a sequence of prompts.

Test. This is the section he leans hardest on. If you change two lines of your Claude.md, do you actually know the impact? Most people YOLO it. The AI engineering crowd has been writing evals for years; coders haven’t. He maps testing onto familiar tiers:

Linting for context — does the skill description fit the spec, is it under the length limit, is the syntax right.
Grammarly for context — ask another LLM “given this skill, do you actually understand what it says?” If the agent can’t parse your two-word instruction, neither can the runtime agent. (He plugs voice coding here — apparently he writes much more elaborate context when he’s talking than when he’s typing with two fingers.)
Unit tests — given a Claude.md that says “every API endpoint must be prefixed with /awesome”, you ask the user to “add an endpoint to save a user”, then ask a second LLM to judge whether the generated code actually starts with /awesome. Without your context, no model would ever do that. So the test is really probing whether your context is doing the work.
End-to-end tests — give the judge LLM tools (a sandbox, a curl), and now it doesn’t just read the generated code, it runs the endpoint.

The catch nobody trained on regular CI is ready for: evals are non-deterministic. Run the same test five times, you get four passes and one fail. So you stop thinking pass/fail and start thinking error budgets — borrowed from SRE. This set of tests gets a 5% budget; that one gets 0%. Your CI pipeline now has to run each eval N times and reason about success rates.

Distribute. Once your context is good, share it. The naive way is checking it into the repo. The grown-up way is packaging: a skill bundles markdown + scripts + sub-docs as a unit, and you publish to a registry — Anthropic’s skills marketplace, Tessl’s registry, your company’s internal one. Then comes the part you can already see coming:

99.9, and I mean that in a very sincere way, of the skills is crap.

He doesn’t moralise. He just notes that if you ran any reasonable eval suite on the public registries, almost nothing would pass. Which is fine — you learn from looking — but it means most teams will end up running their own internal registry of vetted context. And once you have packages, you get dependency hell: the React context-package conflicts with your front-end skill. And then you need security: something like Snyk-for-context that scans for prompt injections, leaked credentials, third-party calls. And then an AI SBOM — who built this skill, with what model, when.

Observe. This is the part Debois finds underexplored. Once your skill is out in the wild, how do you know it’s working?

Three feedback channels:

Agent logs — every modern agent emits structured logs. If everyone’s agent on a project keeps logging “missing context for X”, that’s a signal: write a skill for X.
PR comments — every “this isn’t quite right” comment on an AI-generated PR is feedback on the context, not the code. Don’t keep arguing on the PR; fix the context so next time it doesn’t happen.
Production telemetry — the code generated from context is the code running in prod. When it fails there, that failure should round-trip back into a new test case for the context that produced it.

He also flags a security wrinkle most sandbox stories miss. You can sandbox the agent’s execution, but the agent loads your agent.md and skill.md before the sandbox kicks in — there’s nothing filtering the context itself. He proposes a context filter — a web application firewall but for prompts, scanning for injections and bad patterns as the context is loaded.

The Tessl thesis, lightly

Tessl is Debois’s company; this is a soft pitch. The framing: LLMs are the engine, context is the fuel. You can’t tune the engine — you’re using whatever Anthropic or Google ships. But you can engineer the fuel. Tessl is building tooling for the lifecycle he just described — eval runners, registries, observability — so that “context engineering” becomes a discipline instead of vibes-plus-copy-paste.

He closes with a scaling argument: most people are still in the solo loop (I tweak my own markdown). The next step is the team loop (we share a registry, we eval together). The step after that is the team-of-teams flywheel — one team’s fix becomes another team’s freebie.

Key Takeaways

Treat your Claude.md / agent.md like source code. Version it, review it, test the impact of edits.
Write evals for context changes. Pick a behaviour your context is supposed to enforce (e.g. “endpoints prefixed with /awesome”), generate output with and without the context, have an LLM judge the diff. Suite this up.
Run evals N times, not once. Use error budgets per test, not pass/fail.
Promote repeated helper code into skills. If you keep writing the same wrapper, it’s a workflow — write it as a skill so the agent figures out the variations at runtime.
PR comments and prod failures are inputs to context. Stop arguing on the PR; fix the upstream markdown so the next generation doesn’t repeat the bug.
Don’t trust public skill registries blindly. Most published skills won’t pass a real eval suite. Either curate, or run your own internal registry.
Scan context for prompt injections at load time. Sandboxing the agent isn’t enough — the malicious skill is in the prompt before sandboxing engages.

Claude’s Take

The talk is unpolished and Debois says so up front. Slides admittedly dense, half-formed in places. But the framing — context has a lifecycle, and we should rebuild the DevOps stack around it — is the kind of thing that gets quoted for a year because it gives people language for what they’re already doing badly. “Eval your Claude.md change before merging it” is good, concrete advice that almost no team is doing today.

The honest weakness: the parallels to DevOps are very clean, almost suspiciously so. Linting, unit tests, end-to-end, CI, registries, SBOMs, observability — every box gets a context analogue, sometimes by squinting. Real life will be messier. The non-determinism point (run evals five times, use error budgets) is the place where the analogy actually breaks and he handles it well; the rest stays neat.

The Tessl pitch is mild and earned — he’s not selling, he’s pointing at the gap. And the closing one-liner — LLMs are the engine, context is the fuel — is the kind of thing that survives the talk. 8/10. Not a deep technical talk, but a good vocabulary delivery. Worth watching if you write more Claude.md than code these days.