Amit Jain from Luma AI on Unified Intelligence Systems

ELI5/TLDR

Luma is building AI that handles text, images, video, and audio inside a single brain — instead of stitching separate specialist models together with a thin bridge. Amit Jain, the founder, argues that current image and video models are gorgeous but dumb — they have no memory, no physics, no sense of why anything matters. His fix is “unified models” where one transformer reasons across every medium at once, like how the human cortex processes sight, sound, and language in one place. The bet: whoever cracks this owns the next generation of creative tools, and in a few years, will outperform pure language models because they can learn from a much bigger pool of data about the world.

The Full Story

From lidar on the iPhone to a world simulator

Amit came out of Apple, where he worked on the Jasper lidar sensor — the one now in your iPhone — originally built for the Titan car project and later for the Vision Pro. Around 2020, inside Apple, his team started exploring generative models. This was before DALL-E, before anyone was confident language models would scale the way they have. But two signals had landed: language scaling looked real, and a Berkeley paper called NeRF had shown that 3D could be made differentiable.

That word matters. Think of “differentiable” as “teachable by gradient descent.” If you can tweak a knob and measure whether the output got better or worse, you can train a neural network on it. If you can’t, deep learning is off the table. The whole modern AI stack rests on two things, Amit says, compute and gradient descent — everything else is a dressing on that cake.

Put it together: if language was becoming learnable and 3D was becoming learnable, you could eventually feed the whole universe of observations into one loss function. That’s the founding idea of Luma — a “world simulator” that doesn’t just predict the next word but the next frame, the next sound, the next physical consequence.

The painful lesson about data

Luma’s first product was an app called Luma 3D Capture — people walked around objects with their phones, and the app reconstructed them as 3D. It worked. It was the first productionized NeRF and Gaussian splat pipeline. Users loved it. And it still wasn’t enough.

“It doesn’t matter how many people use the app, it will never reach the scale that was necessary to learn enough about the universe.”

The internet has decades of text, photos, and video. No single app can catch up. So Luma flipped its thinking: don’t design the algorithm and then hunt for data. Design the algorithm around where the data already is.

That’s the current bottleneck for robotics, too — there’s no “internet of action data.” You can build labs in China, India, Vietnam, and grind, but you won’t match what already exists in YouTube and Instagram. So Luma pivoted. In 2023, when Nvidia’s Hopper GPUs arrived, they started training on video — because video is two dimensions of space plus one of time, and the human brain uses that time axis to learn 3D anyway.

Dream Machine and the shape of a frontier lab

March 2024: Luma launched Dream Machine, their first video model. Six million users in the first four weeks, because Sora had been announced but not released, and people were starving to try generative video.

What Luma learned from those users is the less sexy, more important half of running a frontier lab. Pre-training gives you a wild, raw distribution — everything the model could possibly do. But what humans actually find useful is a narrow, unpredictable slice of that distribution. Finding and amplifying that slice is the real job.

Their early signal was simple — which videos did people like and download? Crude, and often wrong (some people download bad AI videos specifically to mock them, which the model happily learned from). So they layered in paid human annotators. Then the picture of a frontier lab sharpened:

“A frontier lab has these components of data, these components of compute, and algorithm, but it also has huge parts of what we call skills and trainers and tutors and people who are doing the labeling of data. If you don’t have that, that is actually not complete.”

And the product itself becomes part of the training loop. Every click, every tweak, every dislike feeds back into the next model.

Why video alone wasn’t enough

By early 2025, another ceiling appeared. Video models could make beautiful clips, but they didn’t understand why an event mattered, or what sequence of events led somewhere, or how a story hangs together. Bolting a language model onto a video model as an embedding lookup doesn’t fix this. You need one brain.

To make the point concrete, Amit gestures at the Luma-built slides behind him, generated by their unified model “Uni 1” from a single prompt plus a screenshot of the host’s diagram. The model read the original layout, understood the style, wrote its own content, and rendered the slides. An LLM can’t do this — it’s blind. A vision-language model (VLM) can see images but can’t generate them. Image generators like Flux can generate but don’t understand. The whole industry has been living in this split-brain world.

The Nano Banana trick, and why it falls short

Google’s Nano Banana — the model most comparable to what Luma builds — uses what Amit calls a “few-shot” architecture. A big language tower generates what’s called an Enhanced Prompt, a detailed text description. Then a separate diffusion tower reads that text through a thin 700-to-800-million-parameter encoder and paints an image.

Imagine two people in separate rooms connected by a fax machine. One writes a paragraph describing what to draw. The other draws what they read. No matter how good either one is, the fax line is a bottleneck. That’s the current state of multimodal models.

The unified architecture

Luma’s bet is to throw out the fax. One transformer, one backbone, all modalities encoded into the same representation space — audio, image, text, video — and all reasoning happens in one place.

“Just like the human brain. While it has different areas for processing visual information, auditory information — those are just encoders. All of that information then ends up in your cortex, and reasoning and thinking and judgment happens in one single place.”

The trick, Amit says, is that transformers themselves are remarkably agnostic about what flows through them. Continuous, discrete — they handle it. The pain is always in the encoders and decoders at the edges, which is where the last year’s research grind has been. After “a huge number of failed attempts” they now have architectures they’re confident can scale into the hundreds of billions of parameters.

The factory and the REPL

Once you have the unified model, you wrap it in what’s essentially a modernized REPL — the read-eval-print loop that’s been at the heart of computing since the Von Neumann days. Three layers:

Skills on top. Domain knowledge given as context, not baked into the model. A 50-page in-house document on how to design good slides, for example, or — for a customer in the energy industry — ingested grid diagrams and code. Amit claims that after ingesting one customer’s energy grid documentation, Luma’s system outproduces Anthropic’s coding models on grid schematics, because it can actually see the diagrams.
The unified model in the middle. Orchestrating, reasoning, generating, deciding which skill to pull.
Tool harness at the bottom. Ability to call APIs, run Linux commands, execute code — whatever real work needs to happen.

This is roughly the shape every agent product is converging on, but Luma wants the middle layer to be one giant multimodal brain rather than a federation of specialists with a judge on top.

Customers, data rooms, and why studios are joining

The customer list is loud. Netflix and Amazon Prime Studio at the same time (“arch nemeses,” Amit calls them). Coca-Cola moving $3 billion in annual content production to Luma. Publicis, the world’s largest ad agency. A $4.5-million-per-episode Prime Video production called Old Stories, starring Ben Kingsley as Moses, produced largely with Luma agents. A SciPlay Games meeting where Amit produced a 500-asset campaign live during the pitch.

The sensitive-data problem is real — a studio will happily let Luma train on their content for them, but not let another studio benefit. So Luma builds data isolation guarantees (SOC 2, plus AI-specific controls) and marks projects so they never enter training loops. What Luma does still learn from is the interaction trace — how creatives steer the tool — rather than the pixels themselves.

On Sora’s shutdown and the market

Asked why OpenAI shut down Sora, Amit’s answer is one word: focus.

“OpenAI at the core of it is a large language model lab. When you do everything, that’s really hard to do… Organizational physics still come into play. Less is more.”

He also pushes back on the assumption that Sora shutting means the video market is small. Google is doubling down. Gemini handles all these modalities. The size of the market is fine — OpenAI just has to narrow its surface area before IPO.

On GANs, diffusion, and what’s next

A student asks about GANs — the generative adversarial networks that ruled 2017-2018. Amit’s answer is honest. Luma still uses GANs for distillation and real-time generation. But GANs are finicky and unpredictable, researchers hate working on them, and they don’t scale the way transformers do. Important principle:

“What researchers want to work on is generally what will get worked on. I can make the case that Rust is more efficient — doesn’t matter. Everybody wants to code in Python, so that’s what will be done.”

Then a contrarian bomb: diffusion models are on the way out too. Luma is moving to hybrid autoregressive-plus-diffusion regimes because pure diffusion has “really really bad habits” that don’t unlearn, and the scaling physics don’t cooperate. The unified models are already this hybrid.

Hollywood is default dead

On creatives embracing AI, Amit is blunt: Hollywood has been dying for 30 years, and it’s not AI’s fault. COVID accelerated it, the writers’ strike sealed it. Production has fled LA — it happens in Greece, Canada, Ireland, anywhere with tax incentives. Hollywood finances movies, it doesn’t make them anymore.

Worse, Hollywood now thinks like private equity. A hit becomes a franchise becomes a rent-seeking machine. Guardians of the Galaxy 2, 3, 4, 5, the seventh Avengers, the crossover multiverse. As a physicist, Amit finds the multiverse-as-storytelling-device personally offensive. But the bigger point is that audiences don’t want this — they want variety, and Netflix’s 800 productions a year proves it.

“It’s not the audience’s job to come to the theaters to keep Hollywood alive. It’s Hollywood’s job to make great things so audiences want to watch it.”

AI is a chance to reset. When execution is cheap, you can parallelize exploration. Try ten ideas instead of gold-plating one. The best creatives throughout history — Einstein, Mozart, Archimedes — were prolific, not precious. Industrial pressure turned artists into one-shotters. AI lets them be prolific again. The mediocre get weeded out. The great get compounded — because the slide design skill one human writes once now runs trillions of times.

The final gap

The closing question: what separates today’s video models from being as generally useful as ChatGPT?

One word. Intelligence.

Current image and video models are beautiful pixel generators with no memory, no introspection, no grasp of physics, no ability to iterate. They’re like working with someone who forgets what you said two minutes ago. The unified model is designed to fix exactly that — multi-turn, context-aware, physically grounded. The RLHF moment for video, basically. Once it clicks, Amit argues, the applications aren’t just prettier ads. A history class where you can actually run counterfactuals — what if the Rubicon wasn’t crossed, what if Ferdinand wasn’t shot — and watch the alternate century unfold.

Key Takeaways

Differentiability is the whole game. Deep learning only works on functions you can run gradient descent through. Every modality that enters AI has to first be made differentiable — text already was, 3D became so with NeRF in 2020.
Design algorithms around where the data is, not the other way around. Luma abandoned 3D capture because phone-scanned 3D data will never outscale YouTube-scale 2D video. Robotics is stuck in this same trap now — no internet of action data.
A frontier lab has four pillars, not three. Data, compute, algorithm — plus skills/trainers/tutors. Without the human labeling and preference-shaping layer, the model can’t find the narrow slice of “useful” inside its raw distribution.
Current multimodal models are fax machines. Nano Banana–style systems use a language tower to write an enhanced prompt, then ship it through a thin 700-800M-parameter encoder to a separate image tower. The bridge is the bottleneck.
Unified architecture = one transformer, many encoders, one reasoning space. Mirrors how the human cortex works: specialized sensory areas feed into a single reasoning substrate.
The encoders and decoders are where transformer architectures actually break. The transformer backbone is agnostic to data type; the pre/post processing is where scaling pain lives.
Luma’s current training scale: ~30 petabytes of multimodal data, H100s today, GB300s soon, in the ~10K GPU range. Equivalent to second-tier language model training, not frontier LLM scale — yet.
Video models to surpass language models in 2-3 years because the data pool for multimodal is strictly larger than pure text.
Product design can make data sensitive in different ways. Pixel outputs can be firewalled per customer; interaction traces can still be trained on because they’re not copyrightable visual IP.
Researcher preference is a law of physics in AI. Whatever researchers want to work on is what advances. GANs are more efficient for real-time and distillation, but no one wants to work on them, so they stagnate.
Diffusion is on the way out. Luma is moving to hybrid autoregressive + diffusion regimes because pure diffusion has scaling pathologies that don’t unlearn.
“Slop” = what people call AI output when they haven’t seen a good one yet. Live demos change minds that blog posts can’t.
Execution used to be the scarce resource for creatives; now exploration is. You can validate ten ideas at once instead of betting everything on one.
Hollywood’s rot is pre-AI. PE-style franchise rent-seeking (the 20th Avengers) is what broke it. Netflix producing 800 mid-budget originals a year is the counter-model.
Skill layer matters more than the model. The 50-page in-house slide-design document is why Luma’s slides look good, not the raw model.

Claude’s Take

This is a good, substantive guest lecture. Jain is clearly building the thing he’s describing, which gives his takes weight the usual podcast AI takes don’t have. The core argument — that the fax-machine architecture between language and image towers is a ceiling, and that one unified transformer is the way past it — is plausible and he’s backing it with money. $1.5 billion raised total, $1 billion in the last year. If they’re wrong, they’re expensively wrong.

The claim I’d flag is the “better than Anthropic’s coding models on grid schematics” anecdote. That’s a narrow, cherry-picked domain, and Anthropic’s models weren’t trained on that customer’s proprietary grid diagrams. It’s a fair illustration of how much domain context matters, but it’s not the apples-to-apples comparison the phrasing suggests. Take it as a marketing line with a kernel of real insight.

The contrarian call that diffusion models are on the way out is worth watching. Most of the AI image/video world runs on diffusion today. If Luma’s hybrid autoregressive-plus-diffusion architecture really scales better, that’s a big architectural call that won’t be obvious until a few more generations of models ship. Worth bookmarking and checking back in 18 months.

The Hollywood-as-PE riff is the most quotable section and probably the most correct. It’s also not new — people have been saying this for a decade — but Jain ties it to AI’s role more cleanly than most: AI doesn’t kill Hollywood, it gives Hollywood a chance to stop being PE. Whether studios actually take that chance is a separate question, and the track record isn’t great.

Score: 8/10. Dense, opinionated, technically specific, with real business detail. Docked slightly for a few places where the founder-on-stage confidence outruns the evidence (the Anthropic comparison, the confidence interval on scaling to hundreds of billions of parameters, the “diffusion is dead” claim).