Headroom: A Context Optimization Layer for LLM Applications - Tejas Chopra, Netflix, Inc.
ELI5/TLDR
A Netflix engineer kept burning through his AI-coding budget and couldn’t figure out where the money went. So he cracked open the pipe between his AI tool and the AI model, and found that most of what gets sent is garbage the model doesn’t need. Headroom is a small program that sits in the middle and quietly trims the fat before it reaches the model — squeezing out 20-30% of the cost, with the option for the model to ask for the original back if it ever needs it. It runs entirely on your laptop and has saved its users an estimated $700,000 so far.
The Full Story
The thing you pay for that you never see
When you chat with an AI coding tool, you imagine a tidy back-and-forth: you say something, it replies, you say the next thing. That is not what happens under the hood.
Every time you send even a single word, the AI tool re-sends the entire conversation so far — every previous message, every file it read, every command it ran — all bundled together and shipped to the model again. The model has no memory between calls. Think of it like a goldfish you have to re-brief from scratch every single time, except the briefing keeps getting longer.
“Every time you just say a hi after everything you’ve said till now, it’s everything that you’ve said till now plus that hi that goes to an LLM. As you can imagine, 99% of it is things that you’ve already sent.”
That growing bundle is the “context window” — the model’s short-term workspace, measured in tokens (roughly, chunks of text). You pay per token. So the bundle is the bill.
Where the waste hides
Tejas Chopra, the speaker, builds the data storage behind Netflix’s recommendations by day. Headroom is his nights-and-weekends project, born from a specific irritation: he kept running out of his daily token allowance on Claude Code and had no idea why.
When he looked, the answer was obvious and dumb. He’d once asked his coding tool to find why his CPU spiked, and the tool dutifully pulled an entire log file into the context window to read it.
“The entire log file was pulled into the context window. That is wasteful. Because 90% of it is waste and garbage that I don’t care about.”
Same story with database calls (you get a big blob of JSON, care about 20% of it), web pages, research papers, and so on. His key realization: most existing “token-saving” tools focus on compressing your prompt — the thing you type. But the prompt is the tiny part. The real bloat is everything else: file reads, tool outputs, web pages, API responses. And you can’t squeeze all of those the same way.
“I realized that one size fits all will not work here.”
The trick that makes it safe: reversible compression
The obvious worry with trimming data before it reaches the model is: what if you cut something the model actually needed? Headroom’s answer is a neat sleight of hand it calls “reversible compression.”
When it squashes a chunk of data, it doesn’t just delete the rest. It leaves behind a little marker — an ID — and tells the model, in effect: “I trimmed this. If you find you’re missing something, call this tool and I’ll hand you the original.” The full version sits in local storage on your laptop, waiting.
“If the LLM wants to get the original context back because you compressed too aggressively, it can make a tool call and fetch that. That is how it is reversible.”
This works because of how AI tools register their capabilities. When you give a model a “tool,” you also give it a plain-English note saying when to use it. Headroom registers a retrieve tool with the instruction: “if something seems compressed or missing, call me.” In practice, Chopra says the model asks for the original only about 1% of the time — usually it finds what it needs in the trimmed version. He even tested this with an old, small model (GPT-4o-mini) and it still reliably made the call.
Different data, different scissors
Headroom doesn’t have one compressor — it has several, each tuned to a data type. This is the core insight.
- Code has structure. You can parse it into a tree (an AST, or abstract syntax tree — basically code’s grammatical skeleton) and strip away the parts that don’t matter, keeping the shape intact.
- JSON (the structured blobs APIs return) gets a “smart crusher” that looks at which fields the user’s question actually cares about, measures which values are outliers versus run-of-the-mill, and squashes the boring ones. Claimed savings: 83-95% in the best case.
- Web pages get a DOM compressor (the DOM is the page’s underlying structure).
- Plain text with no obvious structure gets the hard case. Chopra first tried Microsoft’s LLMLingua, an open-source text compressor, but it had been trained on meeting summaries — nothing like coding work — and underperformed. So he trained his own, called CompressBase.
CompressBase is worth a beat. It is not a summarizer. It doesn’t rewrite or generate anything. It’s an “encoder-only” model whose entire job is to look at each token and vote: keep or cut.
“It’s not a summarizer… It’s just weighing the different tokens and deciding if the presence of a token or an absence of a token impacts the output or not.”
The hidden game of cache pricing
A big chunk of the talk is about a billing mechanic the providers bury in their docs: prefix caching. Because you keep re-sending the same growing bundle, providers offer a deal — if you send us data we’ve seen before, we’ll charge 10% instead of full price and cache it.
The catch: change even one character early in that bundle and the whole cache is invalidated. You pay full freight for everything.
“Even if you change a little bit within that entire window, it is not a cache hit. So we will penalize you for the entire window.”
Where does a sneaky change come from? Often the system prompt — the hidden instructions the provider injects — contains a date or a session ID that changes every time. That alone can blow up your bill. Headroom’s “cache aligner” hunts for these dynamic fields and shoves them to the end of the bundle, so the cacheable front portion stays stable and keeps earning the discount.
There’s more buried nuance. Claude’s cache discount, by default, only lasts 5 minutes of continuous activity. Hit minute six and you pay full price for the whole bundle. And — this is the part Chopra clearly got burned by — that 5-minute clock isn’t really yours to control:
“If Claude decides that it has to fork off a sub-agent… the sub-agent has its own prefix cache. So by the time the sub-agent comes back, you’ve already exhausted your 5 minutes.”
There’s a hidden 1-hour option too, but it costs double on writes to get 90% off reads. Which deal is better depends entirely on how you work. Headroom’s pitch: it studies your past sessions and sets the right option for you automatically. Every provider — Claude, Codex/OpenAI, Gemini, open models — exposes these knobs differently, and stitching them all together is genuinely fiddly. Headroom tries to be the one layer that hides the mess.
How you actually use it, and what it claims
You install it with a simple pip install and either wrap your tool (headroom wrap claude) or call a compress command from your own pipelines. It starts a local proxy and routes everything through it. Nothing leaves your machine.
The headline numbers: typical users save 20-30% (it scales with how many tool calls you make). Across opt-in telemetry, users have saved an estimated 200 billion tokens — about $700,000. The project hit ~1,900 GitHub stars and 30+ contributors in four months, and is being rewritten from Python into Rust.
“That tells you that different providers are charging that much money for bloat.”
It’s not just about money
Two side benefits Chopra stresses. Latency: one user runs Headroom inside a voice agent. Voice feels human only under ~200 milliseconds of delay; the agent’s round-trip was ~300ms. Trimming the data sent to the model shaves the gap. Accuracy: as context windows get longer, models actually get worse at finding the relevant needle. Less hay, better answers.
There’s also an image/video angle — one user records factory machines through smart glasses and uploads the video to Claude for operating instructions, at $3 per video. A Headroom variant that chops the video into pieces drops that to $0.20.
Where it’s going
Future directions: domain-specific compressors (financial data and medical data have very different rules about what’s safe to cut — you can’t just drop numbers from a balance sheet), and a sister project called Headlight focused on “provenance” — tracking exactly what went into a context window and where each piece came from. The reasoning: today’s observability dashboards are built for humans to read, but soon agents will be the ones reading telemetry, so it should be compact and machine-friendly. Because Headroom sits in the middle, it’s well-placed to stamp every token with its source.
Key Takeaways
- AI models have no memory between calls — every turn re-sends the entire conversation, so the bill grows with the transcript, not with what you just typed.
- The real token waste isn’t in your prompts; it’s in tool outputs, file reads, web pages, and JSON blobs the model pulls in. Most prior “compression” tools only target the prompt, which is the small part.
- “Reversible compression”: trim the data but leave a marker so the model can fetch the original via a tool call if needed. In practice the model only asks ~1% of the time.
- One compressor per data type is the core idea: AST parsing for code, field-importance scoring for JSON, DOM stripping for web pages, and a token keep/cut model for plain text.
- CompressBase is an encoder-only model — it doesn’t summarize or rewrite, it just scores each token as keep-or-cut. Trained on agentic coding traces, unlike Microsoft’s LLMLingua which was trained on meeting summaries.
- Prefix caching gives ~90% off re-sent data — but a single changed character (often a date or session ID in the hidden system prompt) invalidates the whole cache and restores full pricing.
- Claude’s cache discount defaults to a 5-minute window, and spawning a sub-agent can silently burn that window. A hidden 1-hour option exists but doubles write cost for 90%-off reads.
- The “cache aligner” moves dynamic fields (dates, UUIDs) to the end of the bundle so the stable front stays cacheable.
- Compressing tokens buys three things, not one: lower cost, lower latency (matters for sub-200ms voice agents), and often higher accuracy, since long context windows degrade model performance.
- Provider discount tags differ: Anthropic gives ~90% with cache_control tags, OpenAI ~50% with no tag, Google ~75% when its caching works at all.
- Headroom runs entirely local (Redis + SQLite, 5-minute TTL on stored originals); no user data leaves the machine, and telemetry is opt-in and limited to token-savings counts.
- Real-world stretch uses: a factory-video pipeline dropped from $3 to $0.20 per upload via an image variant; a 190-page 10-K financial filing saw 34% token reduction with answers intact.
Claude’s Take
This is a competent, honest engineering talk about a real and slightly embarrassing problem: the major AI coding tools ship enormous amounts of redundant data and the pricing around it is deliberately opaque. The cache-window mechanics Chopra describes — the 5-minute TTL, the sub-agent that quietly eats your discount, the date field that nukes your cache — are accurate and genuinely under-documented, and hearing them laid out plainly is the most valuable part of the talk.
The honesty is refreshing. He calls “reversible compression” a marketing term to its face, admits CompressBase “is not the best model,” and flat-out says “we don’t have a good story there” on model drift. That candor buys credibility. The architecture — a local proxy with pluggable, data-type-specific compressors and a tool-call escape hatch — is sensible and the right shape for the problem.
What to hold loosely: the savings numbers. “20-30%,” “200 billion tokens,” “$700,000 saved” all rest on opt-in telemetry from self-selected users who were already token-obsessed — a flattering sample. The accuracy evals are explicitly described as ongoing and partly anecdotal, which is exactly the thing that determines whether this is useful or quietly corrosive. Trimming context to save money is only smart if the model still gets the right answer, and “benchmarks look the same with and without” is a claim that needs more weight behind it than a conference slide can carry. The drift problem he’s honest about — models change weekly, your compressor was tuned against an older one — is a real maintenance tax that could undermine the whole thing over time.
Score: 7. Clear, useful, and mechanism-rich, with a presenter who doesn’t oversell. It loses points only because the core value claims are still under-evidenced and the project is young enough that the hard part — proving compression never quietly degrades answers — is admittedly unfinished.
Further Reading
- LLMLingua (Microsoft) — the open-source prompt-compression project Chopra started with before training his own; good for understanding the baseline approach to text compression.
- Anthropic prompt caching docs — the
cache_controltags, 5-minute vs 1-hour TTL, and pricing mechanics discussed at length. - RTK / LeanCTX — tools that compress CLI command outputs (e.g. GitHub CLI) at the point of the call; the
--compressflag family. - Headroom on GitHub — the project itself, including the demos referenced in the talk.