heading · body

Transcript

First Impressions Of Deepseek V4

read summary →

TITLE: First impressions of DeepSeek V4 (open source) CHANNEL: Arena AI DATE: 2026-04-24 URL: https://youtu.be/AC2jj_jfunQ ---TRANSCRIPT--- This is one of the most important releases of this year and I’ve had the whale of the time testing this model. Before we had a long time to wait for this release. If I look at their Wikipedia page to remind myself how long that was — they had the groundbreaking DeepSeek V3 in December 2024, then the release that really took the world by storm was R1 in January 2025, and then the latest big release was in December 2025. So even the latest release (3.2) was a number of months ago, and we had multiple models from many different labs. If you think back what kind of models were around back late 2024 / early 2025, we didn’t really have ZAI with the GLM models particularly prominently in the field. Kimi 2.5 was just about being released. The competition was much less strong in the Chinese ecosystem. Maybe Qwen models were quite strong but beyond that not so much. So it would be really interesting to explore whether DeepSeek really kept up with everyone else while they’re holding back some of their releases. Did they exceed everyone? Are they going to be better than Opus?

What’s really important for me is that I don’t just look at benchmarks. I don’t just look even at the arena score but I really test the models myself. Quite often I would go on the arena with these prompts — some short, some long — and I get these one-off generations. I can look at the code, quite often just HTML pages, and then I can see the generation. The reason I do that and not go through long code bases is that way we can do a lot of tests, compare many different models or many different prompts, and get a good feel for how these generations come out.

The model we’re going to be looking at the most today is the DeepSeek V4 Pro thinking model, just to give it the best chance against others. So the first generation, this looks like quite a nice generation. We want to see the balloons here and this is Cappadocia in Turkey, these natural stones. But let’s see how it compares to others. We have GPT 5.4 high here which looks more pleasant to me. The sun is quite dramatic, the shapes are good, certainly a good generation. This one is GLM 5.1 — GLM is the one to look out for in terms of competition because GLM 5.1, at least before the DeepSeek release, was the highest ranking model. I actually don’t know where DeepSeek is going to rank. Let’s see if we can gauge that together. This is Opus 4.7. Opus for me wins here, just the whole vibe and quality that feels really spot-on. This is Muse Spark by Meta — you can see this is the weakest generation unfortunately. And we’ve got a couple more GLM, not quite as strong there.

The next one — voxel generation of a Roman city. This is DeepSeek V4, pretty good. Then I want you to see DeepSeek 3.2. When you see these two side by side in terms of quality, it’s pretty crazy. We can argue is it quite as good as Opus, but certainly versus itself a mere about 4, maybe 5 months ago, this is quite a dramatic difference here for sure. Now whether it matches up to some of the others — Gemini 3.1, it doesn’t. Gemini in particular is particularly good at these kinds of generations. GPT 5.4 high is really quite a nice generation. Opus 4.7 as well really good. So it is very exciting to see these kinds of models progressing so much but we are not at a point where DeepSeek gets up to that high quality generation level.

My favorite Golden Gate bridge generation — this one did not do well. The traffic is kind of crazy, the bay doesn’t really work, not the best generation. Some of the sizing is going wrong somewhere. That’s quite weird, I haven’t seen quite that kind of generation. Although the slight silver lining is the fact that it’s weirded out in a very strange way — maybe it has some advantages especially being open source, that we get a bit more diversity. Even so, this was 3.2 which was basically just kind of laughably bad as well, but it is a hard problem. If you look at Opus 4.7, clearly a little bit something wrong with the bridge. Gemini 3.1 the bridge is a lot more coherent but it’s not very interesting. So it certainly is a hard prompt and I wouldn’t say DeepSeek did terribly here, but it’s not quite as good as some of the others.

This one I did like — you see what I mean about diversity. If you look at the fish, it’s not the most interesting, not the highest quality fish. But if you look at Opus it is different from Opus. I wouldn’t say it’s better but it is different. So I wonder whether this kind of development in a different direction could give us a little bit more diversity, which could be interesting. Minimax M2.5 — not quite the latest, but I did like this generation. I would say DeepSeek is better than that. But Gemini, very rich world, very interesting. That wins that round.

SVGs — there were interesting ones and I’d say DeepSeek seems a lot better than it used to be, certainly better than some open source models. Minimax M2.7 a bit worse. GLM 5.1 better to my eye or similar. The GLM seems really good in that category. Muse Spark by Meta worse. Gemini — they’re slightly benchmarked on SVG, so I don’t want to say oh it’s the best model because of it. It is kind of a loose test at the end of the day. I like the cycling one because it gives this movement, it requires a structure. DeepSeek V3.2 for the previous model — this doesn’t even compute what this is meant to be. GLM again a bit better, Muse not as good. So far it feels like GLM is a little bit better, then DeepSeek is the next one, then Minimax and Muse around that area.

This one I really like — the voxel pyramids, want the complexity, the structure, the beauty. That is a nice generation. But Kimi K2.5 thinking, not even quite the latest model, also had some generation, though the DeepSeek one was better. This DeepSeek generation is a little bit off for me, a bit jarring, and I know it’s golden hour, but things not quite hanging together. One thing I noticed — it applies to many slightly second tier models — the ability to maintain consistency is not quite there. There are a few generations I just haven’t selected because they’d be a little bit weird to look at when they don’t compile or come together. Claude Opus 4.7 feels much higher quality, although some spatial issues. Some room for improvement, but again if you compare versus DeepSeek 3.2 four, five months ago, this is crazy what the jump. Muse Spark not so great.

I want to show you this one because I mentioned I didn’t keep all the badly generated ones. This is a good example of just a bad generation. You can kind of see the columns there but it didn’t come together although the other things are nice. Versus Gemini 3 Pro, certainly this is the kind of structure you’re expecting. Sonnet — some of the colors you can quibble with, but certainly much better generations out there. GLM 5.1 came together much better. Stability is certainly something to look out for.

SVGs — quite a nice test because it requires movement and coordination. I’d say here although Gemini is benchmarked on this and everyone is probably right now except Muse, others seem to be catching up. DeepSeek V3.2 nowhere near where it should be in terms of quality.

Let’s look at the Acropolis. It really tests nicely on structure. One thing annoying — I can actually zoom out which I do not like. You can see certain structural issues there, and the fact it didn’t give me free control is not a good generation. Certainly quite a lot of issues. GPT 5.4 high much nicer construction, maybe a little bit boring but quite nice. GLM 5.1 closer to DeepSeek. Opus a little bit odd to be honest, obviously weird elements, but overall construction you’ll have to agree it is nicer there. Gemini a little bit boring but a decent generation.

We looked at SVGs, 3D generations. I also want to show some prompts which are a bit out of distribution almost — UI generations. This one is orbital travel booking console. GLM 5.1 interesting take to have this kind of nice element up front, I like that. Muse Spark not too bad, a little tight and confused, things not quite aligned. Opus 4.7 feels much more organized straight away, I like this. Gemini 3.1 — interesting, I must say I do not understand the creative intent.

1907 world’s fair exhibition site — trying to test creativity, not just a to-do app. This feels quite boring, an okay website, it’s fine. GLM kind of feels worse — heavy lines, structure a bit heavy, foreground map not as good. DeepSeek is better here. Muse Spark I like this ability to add, all feel a bit too modern, too polished. I was expecting a little bit more like that. Opus 4.7 got the intent a bit more — font selection is nice, feels old, this is nice. Opus is the top, DeepSeek up there, Gemini a bit boring, then GLM and then Muse.

Revenue recovery command center. A bunch of charts — feels a little superficial, things not aligned, bright colors. Muse feels more like a dashboard of someone managing revenue. GLM not quite as nice. Gemini not bad, quite functional, a little bare bones. Opus 4.6 kind of bad, looks a bit like DeepSeek. Don’t love these generations. Muse Spark did best here.

Deep Sea research dashboard. Font very tiny, things not quite together, don’t completely love that. GLM 5.1 — the ambition was there but kind of bare. Opus — I wish things were, I don’t know why everyone has big gaps. Maybe it’s out of distribution, but feels nicer, aligned to this research center vibe, you almost imagine yourself in a submarine. Gemini — I don’t like this, kind of empty, things a bit weird. I would have expected to like Gemini more than I am right now.

Retrofuturist home automation OS. We want these retro elements — the little radio, all the different things. Certainly went the more creative way. Opus feels nicer, almost tactile, that is a lot better. Gemini’s idea — also feels tactile, even more tactile, that is more creative. Before between the first and second they were more similar, here maybe Gemini. I like this, deserves some props.

Vertical farm. Feels a bit like a game environment which I don’t mind. GLM similar vibe, maybe a little nicer, the toggles a little better. Functionally feels nicer. It could really be between GLM and DeepSeek, they’re quite close.

Is this a good thing or bad thing? Because DeepSeek was really ahead among the labs in China and open source labs — Nvidia share price dropped 20%, something like that. The fact that it is right now sort of similar to existing open source models — it would be interesting to see first of all where the arena score lands, and how they’re going to iterate. If they’re going to release another model in 6 months, 8 months, that wouldn’t feel really in line with a lot of other open source Chinese labs. Qwen is releasing so many models I can’t even follow. If they get into the cadence of releasing much quicker, maybe they’re going to keep up and even outpace the others.

That’s what’s interesting to me, because right now I don’t think any generations I’ve looked at today would make me say DeepSeek is way better than Opus or way better than any current model. It is maybe on par with GLM. That is my feel. Certainly not quite as stable as some of the leading models. Not quite as creative. But it’s certainly a massive leap versus DeepSeek V3. Massive leap. Clearly must be a new pre-train, new post-train, new many things. But in terms of matching quite to the current leaders — broadly caught up, versus being really quite far behind.

I’m really looking forward to seeing what they’re going to do next. Anytime a model comes out I really recommend: do try these models. Don’t just look at benchmarks. Don’t just look at scores. Don’t even just look at what I’m doing. Try these things yourself and you’ll find some things you disagree with me on. Hopefully this gives you a starter, a feel for where these models are. Go on the arena. It’s a nice place to try these models.