Why We Switched From Claude Code to Codex

ELI5 / TLDR

Dan Shipper and Austin (Every’s head of growth) used to live inside Claude Code. Then OpenAI quietly pivoted Codex from a prickly senior-engineer tool into a polished general-purpose agent desktop app, and around GPT-5.5 it crossed the line into being their daily driver. The pitch is not really “Codex’s model is smarter than Claude’s” — it is “the Codex app is meaningfully faster and better organised, and the model is now close enough that the app difference decides it.” Their bigger point: a coding agent on your machine is no longer just for code. It is becoming the operating system for knowledge work — automating Slack, Gmail, Notion, Stripe, hiring searches, KPI dashboards, go-to-market plans.

The Full Story

The setup: model parity, app divergence

Six months ago Dan calls Codex “trash” — and not in a friendly way. He says OpenAI built it for senior engineers doing pair programming, which meant it would argue with you, make you feel stupid, and had “no emotional intelligence.” Meanwhile Anthropic figured out something else: a model that is fast, smart, and emotionally tolerable, with access to your computer, is wildly useful for programmers and for anyone else doing knowledge work. Claude Code ate the territory.

Then OpenAI did a hard pivot. Codex stopped being a CLI for grumpy engineers and became a desktop app aimed at the same surface Anthropic discovered — the “agent management interface.” With GPT-5.5, Dan and Austin say the underlying models are now at rough parity for knowledge work. Opus is still better at design. Codex has a few things it does better. The deciding factor is the wrapper.

“There’s no comparison for how fast and powerful the Codex desktop app is as just like an app compared to the Claude desktop app.”

Austin’s claim is that the Codex app is roughly 30–40% better as software — faster, better organised, sub-agents that actually work, automations that stick. He frames it as the kind of difference that is hard to overcome with model improvements alone, at least until Anthropic ships a comparable app update. Dan agrees this will be a horse race for a while — every couple of months one side will pull ahead — but for now Codex’s surface is what won him over.

What “agent first” actually looks like in a day

The interesting bit is what Austin does inside Codex. He opens it first thing every morning. It is already wired into Gmail, Slack, Notion, Stripe, and Every’s internal tools. He talks to it the way most of us talk to a chat window, except the requests are concrete chores: make a run-of-show for today’s event, look at the calendar of upcoming launches, write the go-to-market plan from the existing meeting transcripts and Slack threads, ship a PR to the company’s product.

His mental model is two kinds of agents. There are “dumb” agents — automations that do the same thing every day at the same time, like an end-of-day pass over unanswered messages that drafts replies in Slack and Gmail for him to thumbs-up. And there are “smart” agents — the ones you brainstorm with, the strategic-partner kind. Codex builds both, and the dumb ones turn out to need surprisingly little tweaking to be useful.

A small detail worth keeping. Austin does the drafting inside Codex but moves the final review step into the external app — Slack’s draft tab, Gmail’s drafts, a Notion doc. Not because Codex can’t show the draft inline, but because switching surfaces freshens his eyes before something goes out to a human. It is a small workflow rule that protects against the slop-by-default problem.

The go-to-market plan example

The clearest demo is when Austin gets squeezed between meetings and still needs to ship a launch plan. Normally this would mean blocking off a full day or staying up late. Instead, he points Codex at the meeting transcripts in Notion, the Slack threads, his go-to-market template, and the upcoming calendar of posts. He runs a “compound engineering brainstorm” step — a workflow that asks clarifying questions, then drafts — and tells it to ship the plan to a markdown doc.

His verdict: 80–90% of the way there. The model is not inventing strategy. It is consolidating thinking the team has already done across scattered surfaces and putting it in one structured place. That distinction matters. He is comfortable with this because the agent is acting as a librarian and a drafter, not as the actual strategist.

“I don’t make this plan for humans. I make this plan for humans and agents and primarily for humans to understand through agents.”

This is the part worth dwelling on. Austin writes the plan in a format where teammates can either read it directly or ask their own agent to summarise the business case, pull the pricing section, answer questions against it. Dan calls this “normalise sending agent documents around” — the cultural shift away from making AI write in your voice and toward just letting AI write, on the condition that you stand behind every line of it. If someone questions a bullet point and you say “oh, I didn’t know that was in there,” you are exposed.

The recruiting party trick

Dan tells one story that lands harder than the others. Every was hiring a head of learning and development. Dan had a hunch good candidates would have come through General Assembly, the New York programming-bootcamp company that was strong in the early 2010s. He asked Codex to pull a list of GA alums and filter for people who later moved into AI. It produced a list. The first name on it followed Dan on Twitter, looked perfect, and Dan just DM’d him. Maybe nothing comes of it, but the search itself — needle in a haystack, defined by a fuzzy hypothesis — used to be a multi-day chore.

The KPI sheet problem (and a useful confession)

Not everything just works. Austin has been rebuilding Every’s KPI tracker in Notion so that every agent in the company can read the same source of truth. He tried to one-shot it. The numbers were 5–10% off — formatting, framing, edge cases in how MRR is calculated. For a business KPI dashboard, 3% off is unusable.

What he is doing instead is going column by column, end to end, verifying each one is exactly right. It feels stupid to him, he says, because he has been spoiled by how powerful the models feel. But this is the honest texture of using these tools — they get you most of the way there fast, and then you have to do the boring verification work yourself when the cost of being wrong is real.

“It turns out that figuring out how much money you’re making and how much you’ve grown is truly a philosophical question.”

Dan riffs on this. There is no single correct way to measure MRR. You just have to pick a definition and apply it consistently. The model cannot resolve that for you because it is not a technical question. This is a useful guard against the “AI will do everything” hype — there is still a chunk of knowledge work that is decision-shaped, and the agent cannot make the decision for you.

Why people resist switching

Austin has been telling friends in New York to try Codex. Their faces fall. They are deep into Claude Code or Claude Desktop, and switching feels like learning a new operating system. He thinks the resistance is emotional — the cost of learning a new tool is real, even when the new tool is 30–40% better. Dan agrees. He thinks the right move is to bounce between them periodically, because the race is far from over and the wider point — that an agent is now the primary surface for knowledge work — applies regardless of which tool wins this month.

Compound engineering, slightly tweaked

A loose thread worth pulling. Both Dan and Austin use Kieran Klaassen’s “compound engineering” plugin — a workflow that adds plan, brainstorm, and review steps around any task. Austin found the built-in reviewers (security, front-end design) were too engineering-specific for marketing work, so he forked a version called “compound knowledge” that swaps in reviewers for strategic alignment and data accuracy. He says forking it taught him more than just using it. The plugin is public on Every’s GitHub.

The think-week answer

Someone asks how a team finds time to play with all this when they are slammed with day-to-day work. Dan’s answer is honest: it has to be cultural. Every does a “think week” twice a year where nobody does their normal job — they just play with new tools. He suggests teams who can’t go that far do a quarterly day of the same thing. His underlying argument is that the tools are moving fast enough that doing your existing job at maximum speed is no longer a winning strategy. Someone with a worse work ethic and a better workflow will beat you.

Key Takeaways

Codex’s model and Claude’s model are roughly at parity for knowledge work; the Codex desktop app is meaningfully better software (speed, organisation, sub-agents, automation setup).
Anthropic will probably catch up — this is a horse race for the next year — but the bigger point is the surface itself: a coding agent on your machine is becoming the OS for knowledge work.
Two kinds of agents are worth building: dumb ones (run on a schedule, do the same thing reliably) and smart ones (creative partners you brainstorm with). Codex builds both.
Move the final human-review step out of the agent app and back into the external tool (Slack, Gmail, Notion). It freshens your eyes before slop ships.
“Normalise sending agent documents around” — but only on the condition that you stand behind every line. Otherwise you get exposed in the next meeting.
Models still cannot resolve definitional questions for you (how do you measure MRR?). For anything where 3% wrong is unacceptable, you go column by column.
For organisations: build in deliberate play time. The tools are changing faster than your existing job description can keep up.

Claude’s Take

Honest read: this is half useful and half infomercial. The “we switched” framing is dramatic — Every is a media company that lives off being early to AI tools, and there is an obvious incentive to keep producing “we changed our minds” content. OpenAI also gave the audience free credits during the call, which is fine but worth noting. Discount the breathless bits accordingly.

What survives the discount is genuinely interesting. The claim that the model is no longer the bottleneck and the app wrapper is, rings true for anyone who has used both. The framing of dumb vs smart agents is clean and portable. The KPI dashboard story is the most honest moment in the conversation — these tools get you 80% there and then you have to do the boring 20% by hand if the cost of being wrong is real.

The part worth taking seriously is the workflow rule about moving the final review out of the agent surface. That is a small operational insight that protects against a real failure mode — the homogenisation of everything that touches AI. If you read a sentence inside the tool that wrote it, you are already half-convinced. Switching surfaces breaks the spell.

What is missing is any honest accounting of where the tools fail. Privacy, security, what happens when the agent silently sends the wrong thing to the wrong person, what the actual error rate is for “automations that just work.” Austin says they “just work,” but also admits the KPI sheet was 5–10% off on a one-shot. Both can be true; the talk does not reconcile them.

Score: 6/10. Useful for the workflow patterns and the dumb-vs-smart framing. Skim-able for the rest.