DeepSeek's New AI Is A Game Changer

ELI5/TLDR

DeepSeek released a new open vision model that, instead of describing images in long paragraphs, learns to point at things while it thinks. Pointing is cheaper and more accurate than describing, so the model uses about 90% fewer visual tokens than the big paid models and still matches or beats them on benchmarks. The trick is training one student model by copying from a panel of specialist teacher models, each good at one kind of visual task. The blueprint is free and open, which means almost anyone can bolt it onto their own model.

The Full Story

The problem with describing images in words

Ask a typical vision model to count the people in a crowded photo and it will start narrating. There are some people on the upper left, a few stripy guys in two rows, kind of three rows, some standing, some sitting. By the time it finishes the sentence, it has lost track of the count. Two costs follow. The narration is error-prone, and it burns a lot of tokens, which means time and money.

Humans don’t do this. We point. One, two, three, done.

Don’t describe images like a poet. Point like a human.

Pointing as a primitive

DeepSeek’s contribution is letting the model use visual primitives — points, boxes, traces — as part of its chain of thought. The model can mark where it is looking while reasoning, the same way you’d jab a finger at a screen.

The side effects are nice. Faster answers. Cheaper inference. And, importantly, the reasoning becomes inspectable. Give the system a maze with a start and an end, and you don’t just get the answer — you can trace the model’s path through the maze visually. Ask which object the crown is connected to and it will both say “the octopus” and show you the line it drew to get there.

This is more than a parlour trick. If a model goes wrong, you can see where its finger landed and fix that step. It’s a small step toward AI you can audit rather than AI that hands you a soup of numbers.

The numbers, with a sanity check

About 90% fewer visual tokens than frontier models. And the accuracy holds — on an average of seven benchmarks, this free system matches or beats systems that cost billions to build.

The usual move in this corner of the field is to invent your own benchmark, win it, and call it a day. The DeepSeek paper deliberately excludes its own in-house benchmarks from the headline number. The comparison is on shared turf.

How they trained it

The training trick has a name: policy distillation. Imagine a roomful of specialist tutors. One is the world’s best at drawing bounding boxes around things. Another is unbeatable at tracing mazes with points. None of them, on their own, is the model you want — you want one student who can do all of it.

So you train a student. The student attempts a task, each relevant teacher shows what it would have done, and over enough rounds the student absorbs the lot. The distilled student ends up being decent at every kind of visual thinking the teachers knew.

The paper is a blueprint, not a model release. That means other open-weight models can borrow the technique.

Where it still breaks

Three limits, called out honestly:

The model doesn’t reach for the pointing mode on its own. It needs a word in the prompt to trigger that style of thinking.
Bounding boxes are great for counting people, less great for counting blades of grass or strands of hair. Compressing visual tokens means losing the very fine stuff.
The topological reasoning — the maze-tracing kind — doesn’t generalize as cleanly as you’d want. Show it something genuinely new and it might wobble.

Key Takeaways

DeepSeek’s new vision-language paper teaches the model to reason with visual primitives (points, boxes, traces) instead of describing what it sees in words.
Roughly 90% fewer visual tokens than frontier models, with matching or better accuracy on a seven-benchmark average.
Benchmarks excluded the paper’s own in-house tests — the comparison isn’t rigged.
Training method: policy distillation. A student model learns from a set of expert teachers, each strong at a different visual task.
Bonus: the reasoning is visually inspectable, which makes errors easier to localize and fix.
It’s a paper, not a released model — a blueprint that other open-weight systems can adopt.
Limits: the model needs a prompt cue to activate this mode, struggles with very fine structures (hair, grass), and topological reasoning doesn’t always generalize.
Counterintuitive lesson: more pixels and higher resolution aren’t always the path to smarter vision models. Sometimes less is more.

Claude’s Take

The honest version: Two Minute Papers always sounds breathless, but the underlying idea here is genuinely interesting and not just hype. Vision-language models burning thousands of tokens to describe an image they could have just pointed at has been an obvious inefficiency for a while. Treating “where to look” as a first-class output rather than a side effect of language is the kind of architectural move that tends to age well.

The 90% token reduction number is the kind of figure that gets people excited, but the real story is interpretability. A model that points at its working is a model you can debug. That matters more than the speed-up for anyone who actually has to ship one of these things.

Karoly is also right to flag the limits. Needing a prompt cue is awkward — it means the model hasn’t internalized when to switch modes. And the fine-structure problem (hair, grass, anything where you can’t crisply put a box around it) is the kind of failure case that comes up in the real world more than benchmarks suggest.

Score 8. Solid synthesis of a real result, honest about caveats, and the underlying paper is actually a contribution rather than a re-skin. Half a point off for the breathless framing, but that’s the channel’s voice and it’s earned the right.