heading · body

YouTube

First impressions of DeepSeek V4 (open source)

Arena AI published 2026-04-24 added 2026-04-24 score 6/10
ai llm deepseek open-source model-comparison benchmarks
watch on youtube → view transcript

ELI5/TLDR

DeepSeek just dropped V4, their first major model since December 2025. The reviewer tested it the old-fashioned way — feeding it the same prompts he gives everyone else (3D scenes, SVGs, retro UI mockups) and eyeballing the outputs. Verdict: massive leap over their own V3.2 from four months ago, but only roughly tied with GLM 5.1 and clearly behind Opus 4.7, Gemini 3.1, and GPT 5.4. A respectable catch-up, not a coronation.

The Full Story

What we’re actually testing

The reviewer runs what’s basically a visual Pepsi Challenge. He throws the same prompt at every recent model — DeepSeek V4 Pro (the thinking version, given the best shot), Opus 4.7, Gemini 3.1, GPT 5.4 high, GLM 5.1, Minimax, Kimi, Meta’s Muse Spark — and judges the one-shot outputs side by side. The prompts are things like “voxel Roman city,” “Golden Gate bridge,” “orbital travel booking console,” “retrofuturist home automation OS.” No benchmarks, no arena ELO, just does-it-look-right.

Why skip benchmarks. Because benchmarks get gamed. Arena scores aggregate over thousands of prompts but hide the texture — what a model is good at, what it botches. Running the same test yourself across ten models in an afternoon tells you more than a leaderboard.

The leap from V3.2 to V4

This is the unambiguous win. DeepSeek’s last major release was 3.2 in December 2025. Between then and now, other labs kept shipping — Anthropic, Google, OpenAI, Meta, plus the Chinese pack (ZAI’s GLM, Alibaba’s Qwen, Kimi, Minimax). DeepSeek went quiet and people started wondering if they’d been lapped.

V4 answers that. The voxel Roman city comparison between V3.2 and V4 is, in his words, “pretty crazy” — not a marginal bump, a different tier of output. The SVG that V3.2 couldn’t even render into something recognizable, V4 handles cleanly. So whatever they’ve done — new pre-training, new post-training, probably both — it closed the gap.

Where V4 actually lands

Roughly on par with GLM 5.1. That’s the reviewer’s honest summary. GLM 5.1 was, before this release, the strongest open-source model. DeepSeek V4 matches it, sometimes edges ahead on UI creativity, sometimes falls behind on structural coherence.

The gap to the frontier closed-source models is still real. On the Cappadocia balloon scene, Opus wins on overall vibe. On the voxel city, Gemini wins on richness. On SVGs with motion, Gemini is still best (partly because Google benchmarks on SVG, he notes). On the retrofuturist home OS, Gemini and Opus both deliver a more tactile, believable feel than V4.

One recurring weakness — consistency. Several DeepSeek generations had pieces that didn’t compose. A pyramid in golden hour where the lighting didn’t hang together. A Golden Gate bridge with strange sizing artifacts. The reviewer flags this as a pattern across second-tier models generally: they can produce the parts, but stitching them into a coherent whole is where the frontier models still pull ahead.

A small argument for diversity

One genuinely interesting observation. On the fish generation, V4’s output was different from Opus’s — not better, just different. The reviewer wonders if open-source models developing along their own axes might produce more visual diversity in the ecosystem. Everyone training on the same data with the same techniques tends toward the same aesthetic. If DeepSeek’s weirdness is the side effect of a genuinely distinct training recipe, that’s a feature, not a bug.

The release cadence question

Here’s the strategic concern. DeepSeek’s December 2024 V3 release rattled the market enough that Nvidia dropped something like 20% — the “wait, you can do this cheaper?” moment. R1 in January 2025 amplified that. Since then, quiet.

If V4 is only par with GLM, and DeepSeek’s next release is another 6-8 months away, they fall behind. Qwen is shipping so fast the reviewer says he can’t keep track. The open-source Chinese ecosystem has moved to a fast-iteration cadence. DeepSeek’s old mode — drop a bomb, go silent for a year — might not work anymore.

Key Takeaways

  • V4 is a big internal jump, a modest external one. Enormous improvement over DeepSeek V3.2. Roughly tied with GLM 5.1. Still behind Opus 4.7, Gemini 3.1, GPT 5.4 high.
  • Consistency is the weak spot. V4 can produce the parts but sometimes fails to compose them into a coherent whole. Frontier models still hold this line.
  • Test models yourself. Benchmarks and arena scores compress too much. Ten prompts across ten models in an hour tells you more than any leaderboard.
  • Cadence matters now. DeepSeek’s old once-a-year drop may not cut it in an ecosystem where Qwen ships monthly.
  • The Pro thinking variant was used — i.e., V4 got its best shot. Regular V4 would presumably rank lower.

Claude’s Take

This is a useful five-minute sanity check on DeepSeek V4, delivered without drama. The reviewer doesn’t oversell or undersell — V4 is clearly a real release, clearly not a blow-the-doors-off release. “Par with GLM” is a specific, testable claim, not hedged marketing talk.

What the video lacks is rigor. Side-by-side eyeballing of HTML/SVG generations is fine as a first pass but it’s not a benchmark. No coding tasks, no long-context tests, no agentic behavior, no reasoning problems. If you only care whether V4 can make a pretty voxel scene, you have your answer. If you care whether it can write a React component that compiles, you’re still in the dark. Worth noting that DeepSeek’s historical strengths — math, code, reasoning — are exactly what didn’t get tested here.

The diversity point is the most interesting bit buried in an otherwise workmanlike review. We’re heading toward a world where five or six top models all trained on substantially overlapping data start producing outputs that feel like variants of each other. An open-source model with weird training choices might be valuable precisely because it’s weird. Whether that’s DeepSeek V4 or just wishful thinking from one aesthetic judgment call, time will tell.

6/10 — a reasonable first-look video, but you’d want to pair it with a technical deep dive before forming a real opinion.

Further Reading

  • DeepSeek R1 (January 2025) — the release that caused the Nvidia drop, worth revisiting for context on what the V-series was reacting to
  • LMArena (arena.lmsys.org) — where the reviewer runs these side-by-side tests; good place to poke models yourself before trusting anyone’s take