Small models also found the vulnerabilities that Mythos found

ELI5 / TLDR

Anthropic made a big splash announcing that their frontier model Mythos could find real security vulnerabilities in code. Stanislav Fort at AISLE ran the same tests on small, cheap open-source models and found they could recover most of the same results — sometimes outperforming the expensive frontier models. His takeaway: the hard part of AI cybersecurity isn’t having the smartest model, it’s building the system around it. A thousand adequate detectives searching everywhere beat one brilliant detective who has to guess where to look.

The Full Story

The Mythos moment

Anthropic recently announced Mythos, positioning frontier AI as essential for cybersecurity vulnerability detection. They backed it with serious money — $100M in usage credits and $4M in donations. The message was clear: cutting-edge models are needed to find real bugs in real code. Fort decided to test that claim directly.

Small models, same bugs

Fort ran eight models against the same vulnerabilities Mythos was demonstrated on. The results were striking.

On a FreeBSD NFS exploit, all eight models identified the vulnerability — including GPT-OSS-20b, a model with just 3.6 billion active parameters that costs $0.11 per million tokens. That’s pocket change compared to frontier model pricing.

On an OpenBSD SACK bug, GPT-OSS-120b (5.1B active parameters) recovered the full public exploit chain. Though there was a wrinkle — Qwen3 32B scored perfectly on the FreeBSD test but then declared the SACK code “robust to such scenarios.” Same model, wildly different performance on a different task.

The jagged frontier

This inconsistency is the core finding. Fort calls it the “jagged frontier” — there’s no stable best model for cybersecurity. Rankings reshuffle completely depending on whether you’re looking at buffer overflow detection, integer wraparound reasoning, or data-flow tracing.

There is no stable best model for cybersecurity. The capability frontier is jagged — it doesn’t scale smoothly with model size, model generation, or price.

On an OWASP false-positive discrimination test, small open models actually outperformed most frontier models. DeepSeek R1 and GPT-OSS-20b correctly identified safe code, while Claude Sonnet 4.5 confidently mistraced the data flow and flagged it incorrectly.

Sensitivity without specificity

An April 9 update added an important caveat. Most models showed strong sensitivity — they could spot vulnerabilities in unpatched code. But many had poor specificity, flagging already-patched code as vulnerable by fabricating technical arguments about hypothetical bypasses. They’d detect the bug, then refuse to believe the fix actually worked. That gap between “can find bugs” and “won’t waste your time with false alarms” is where production systems live or die.

The moat is the system, not the model

Fort’s company AISLE has discovered over 180 externally validated CVEs since mid-2025, including 15 in OpenSSL and 5 in curl. The OpenSSL CTO praised the “high quality of the reports and constructive collaboration throughout remediation.” They didn’t do this with restricted frontier access — they built the scaffolding around capable-enough models.

The real bottleneck, Fort argues, is everything around the model: proper targeting of what code to analyze, iterative deepening when something looks suspicious, triage workflows to filter noise, and — crucially — building trust with open-source maintainers who actually have to fix the bugs you find. The model is the easy part. The system is the moat.

Claude’s Take

This is a well-argued, evidence-backed counter to Anthropic’s marketing narrative, and it lands. Fort isn’t claiming small models are better — he’s claiming the capability distribution is lumpy and unpredictable, which is a subtler and more useful insight. The jagged frontier framing is genuinely clarifying.

The AISLE track record (180+ validated CVEs, praise from OpenSSL’s CTO) gives Fort standing to make these claims. He’s not theorizing — he’s shipping results with the exact approach he’s advocating for.

The sensitivity vs. specificity update is the most practically important part. Finding bugs is table stakes. Not drowning maintainers in false positives while doing it — that’s the actual engineering problem. It’s the difference between a fire alarm that works and one that goes off every time you make toast.

One thing to watch: Fort has obvious commercial incentive here. AISLE competes in the AI cybersecurity space, and “you don’t need frontier models” is also “you don’t need Anthropic’s product.” That doesn’t make him wrong, but it’s worth noting the frame.

Score: 8/10. Concrete evidence, clear thesis, practical implications for how AI cybersecurity actually works vs. how it’s marketed. The kind of piece that ages well.