Deepseek V4, GPT-5.5, Kimi K2.6, MiMo Pro, video game agents, 4K editing: AI NEWS

ELI5/TLDR

A wild week in AI. Deepseek finally shipped V4 to mixed reviews, OpenAI’s GPT-5.5 reclaimed the top closed-model slot, and two new open-source models, Kimi K2.6 and Xiaomi’s MiMo 2.5 Pro, tied for the open-source crown. On the side: an open-source agent that builds video games end to end, a HuggingFace agent that does ML research on its own, a humanoid robot that ran a half-marathon in 15 minutes, and Google’s Vision Banana, which segments images with disturbing precision. Most of it is genuinely open-sourced.

The Full Story

The closed-model headline: GPT-5.5

OpenAI shipped GPT-5.5, which the host claims is now noticeably more capable than Claude Opus and the rest of the closed pack. No benchmarks shown here, just vibes and a promise that he covered it in a separate video. Take that for what it is.

The open-source three-way: Kimi K2.6, MiMo 2.5 Pro, Deepseek V4

This is the actual story. Three large open-source models dropped in the same week.

Kimi K2.6 is the new headline open-source model. 1.1 trillion parameters, ~600 GB to host, on par with GPT-5.4 High and Opus 4.6 on agentic, coding, and visual benchmarks. The bragging-rights demo: it autonomously downloaded Qwen 3.5 onto a Mac, then rewrote the inference code in Zig (a niche systems language), and after 12 hours and 4,000 tool calls, pushed throughput from 15 tokens/sec to 193 tokens/sec. It also handles agent swarms, up to 300 sub-agents across 4,000 coordinated steps, which is the kind of capability that turns prompts like “scrape 100 jobs and write 100 tailored resumes” into a real workflow.

Xiaomi MiMo 2.5 Pro ties Kimi at the top of the artificial-analysis intelligence index. The flex: it built a full desktop video editor (multitrack, crossfades, audio mixing, export) in 8,000 lines of code over 11.5 hours of autonomous work. MiMo also wins on efficiency, fewer tokens per trajectory than its peers. Open-source release “soon”; for now you can use it via their chat interface or API. There’s also a multimodal version (MiMo 2.5, no Pro) handling images and audio.

Deepseek V4 is the disappointment of the week, but only relatively. It came in two sizes: a Pro at 1.6T parameters (49B active) and a Flash at 284B, both with 1M-token context windows. On benchmarks it lands roughly where Opus 4.6 and GPT-5.4 do, but among open-source models it ranks third behind Kimi K2.6 and (on the LMArena blind-test board) GLM 5.1. It does win the “vibe code bench,” whatever that is. The hype was disproportionate to the result, but the model is fully open-sourced and still cheaper than any closed alternative.

Tencent Hi3 preview also dropped, a 295B-parameter mixture-of-experts model with only 21B active. Five times smaller than the trillion-parameter set, but matches them on reasoning and agentic benchmarks. Open-sourced.

Alibaba Qwen 3.6 27B is the mid-size winner, a dense 27B model (all parameters active, no MoE shortcuts), 55.6 GB on disk, fits on a single high-end GPU, beats the much larger open Gemma 4. Natively multimodal.

The agent-builds-stuff category

Open Game is an open-source agent that builds full video games end to end, classification through asset synthesis through verification, with a self-correcting debug loop and a growing library of reusable templates. The demos play. Better than zero-shot prompting a frontier LLM.

Open Code Design is a self-hosted alternative to Lovable or Figma AI, plug in any model and it builds UIs, slide decks, posters, PDFs from a text prompt or reference image. Single .exe install.

HuggingFace ML Intern is the most interesting one. An autonomous agent that reads research papers, finds relevant datasets, fine-tunes models, and writes ML code, all glued together by the HuggingFace ecosystem. Demo: told to “train the best model for scientific reasoning,” it dug up a benchmark, fine-tuned Qwen 3, and lifted GPQA scores from 10% to 39% in 10 hours. A graduate student in a box.

The image and video tools

Vision Banana (Google) is a unified image-understanding model that produces segmentation maps, depth maps, and surface-normal estimates from any image. The segmentation demos are absurdly precise, “color each piece of garlic differently” and it does. State of the art on 3D understanding too. Technical report only, no open-source release announced.

GPT Image 2 (OpenAI) is the new image-generation leader. The headline demo: a 100-poster grid of anime shows where every poster reads cleanly, plus a faked Windows 11 desktop screenshot with a working Slack chat and a coherent Excel spreadsheet. Beats Nano Banana on text rendering and detail.

UniGen is an interesting design: one model that both generates and detects fake images. Training one improves the other, generation gets more realistic, detection gets sharper. Open-sourced.

Edit Crafter edits images at up to 4K resolution, requires 24 GB VRAM. Tends to oversaturate.

Uni Geo does precise camera-controlled image edits, “pan left 16 degrees, tilt up 7 degrees”, by reconstructing the input as a 3D point cloud, then re-rendering. Coming soon.

Multiworld generates video-game worlds with multiple agents and multiple camera angles in sync, a director who knows where every actor and camera sits in 3D space. Useful for robot training data. Open-sourced including datasets.

LTX HDR LoRA is a 340 MB add-on to LTX’s open-source video generator that converts 8-bit SDR output to HDR. Mostly useful for color grading downstream.

Co-inact generates UGC-style influencer product videos from one product photo and one person photo. Step-by-step prompts get rendered sequentially. Bad news for the influencer economy. Open-source release within a week.

Uni Mesh generates and edits 3D models from text or images, plus runs in reverse to caption existing 3D objects. Model release in late May.

Robotics

The second Beijing humanoid robot half-marathon happened. Last year UniTree’s H1 dominated; this year a Huawei spin-off called Honor took the podium. Their robot, “Lightning,” finished 21+ kilometers in 15 minutes 26 seconds, which on paper beats the human world record by nearly 7 minutes (the comparison is suspect, the robot was likely on a different course or rolling, but the host doesn’t dig in). 100+ robots competed, 5x last year, with 40% running fully autonomously, up from a handful last year.

UniTree also released an acrobatic balance demo: their bipedal robot doing tricks on a single wheel per foot, then on rollerblades, then on ice skates. The point: bipedal robots are inherently top-heavy and unstable, so adding wheels or blades multiplies the control complexity. Thousands of micro-adjustments per second, all real-time.

Key Takeaways

The open-source frontier has caught up to the closed frontier on benchmarks, with Kimi K2.6 and MiMo 2.5 Pro both at parity with GPT-5.4 High and Opus 4.6.
Three serious 1T+ parameter open-source models in one week (Kimi, MiMo, Deepseek V4 Pro). The bottleneck is now hosting, not access.
Agentic capability (long-running, multi-tool, autonomous) is the new frontier. 12-hour autonomous runs with thousands of tool calls are now baseline demos.
Image generation has bifurcated: GPT Image 2 for raw quality, Vision Banana for understanding (segmentation, depth, normals).
Humanoid robotics is moving fast enough that “fully autonomous” went from rare to 40% in one year of marathon entries.

Claude’s Take

Standard weekly AI Search dispatch, evenly paced and exhaustive in the sense that it lists everything without deciding what matters. The reporting is fine but flat, every model gets the same “pretty awesome” treatment whether it’s a frontier release or a research-paper preview. Nothing is contextualized against what came before, and benchmarks are quoted without skepticism. The Beijing marathon claim, that a robot beat the human half-marathon world record by 7 minutes, is presented as obvious progress when it almost certainly involved a different course profile, not a like-for-like comparison. A score of 6 reflects what this is: a useful weekly inventory if you want to know what dropped, low on synthesis or judgment.

The real signal buried under the inventory: the open-source vs closed gap is now a few benchmark points and a hosting bill, not a generational gap. And the agentic-runtime numbers (300 sub-agents, 4,000 coordinated steps, 12-hour autonomous sessions) are the specs that will matter in 12 months, not the benchmark deltas.