CS153 ‘26: Frontier Systems - Mati Staniszewski, ElevenLabs

ELI5/TLDR

Two Polish guys got tired of watching foreign movies where every single character, men and women, kids and villains, is dubbed by the same monotone old man reading in a deliberately flat voice so you can “interpret the emotions yourself.” They quit their jobs at Google and Palantir, taught a computer to read text out loud like a real human actor, and four years later their company ElevenLabs does $430 million a year, restores voices for people who lost theirs to ALS, and lets an Argentinian president give his UN speech in English in his own voice. This talk is the founder explaining how they got here, how voice AI actually works under the hood, and why they still think the clunky-sounding approach is going to beat the fancy one for a while longer.

The Full Story

The Polish dubbing thing

Mati Staniszewski grew up in Poland, where foreign movies get a specific and quietly insane treatment. Instead of hiring different actors to voice different characters, Polish television uses something called a “lektor” — one single man reading every line, for every character, on top of the original audio. Men, women, children, goblins, Meryl Streep, Arnold Schwarzenegger. All one guy. And here’s the kicker: the lektors are actually trained to read in a flat, unemotional voice. The official theory is that this lets the audience “interpret the emotions themselves.” Mati, understated about it, notes that “any Polish person will account how like not good experience that is.”

This is the origin story. Not an elegant research insight. A national cinema experience so strange that two engineers leaving big tech thought: surely by 2022 a computer can do better than this.

Mati and his co-founder Piotr (both ex-Google and ex-Palantir) started ElevenLabs in London in 2022 with a simple-sounding goal that turned out to have three hard problems nested inside it. If you want to dub a movie automatically, you need three different AI models working in sequence.

First you need transcription. A model listens to the original audio and writes down who said what, stripping out background noise and figuring out which character is speaking. Then you need translation. The text goes into a language model that rewrites it in the target language. Then you need text-to-speech — the model reads the new text out loud, ideally keeping the original actor’s voice and emotional delivery. Three models, chained together, each one’s mistakes feeding into the next.

The class host Andrej (who was running platform at Discord when Mati found him) points out that this is what computer scientists call a cascaded pipeline. Think of it like a bucket brigade. Each model is a person in the line. Each one hands its output to the next one. If anyone fumbles, everyone downstream suffers. In 2022, with language translation still being mediocre, all three buckets were leaking.

The first hard pivot

So Mati did something useful. He called creators and studios and asked what they actually wanted. Their answer was not “please dub my movie into twelve languages.” Their answer was more like: “Can you just fix the little mistakes in my voiceover without me having to re-record the whole thing?” Or: “Can you read my audiobook script in my own voice so I don’t have to spend three weeks in a booth?”

This is the moment the company pivoted. Instead of trying to fix all three models, they picked the hardest and most useful one — the final step, text-to-speech — and went deep. In 2022 that meant making computer speech sound actually human.

Two things were broken at the time. One: you couldn’t really clone a specific voice. The standard approach was to manually set knobs for gender, age, accent, and try to predict them — imagine trying to describe your best friend’s voice using only a dropdown menu. Two: the computer didn’t understand context. It would read a happy sentence in the same monotone as a sad one, because it didn’t know it was supposed to be happy. It had no idea it was reading dialogue versus narration.

“If you are reading it, you know that it’s happy. You deliver that in a happy way. If it’s a dialogue sequence in the book, you know it’s a dialogue.”

The breakthrough borrowed from a different field entirely. Large language models had just started working. The core trick of a language model is that it predicts each word based on all the words that came before it — it has context. Mati and Piotr realized you could bring that same trick into speech generation. Let the model read the whole paragraph first, notice that it’s an angry argument or a tender goodbye, and then narrate accordingly. Meanwhile, instead of hard-coding voice parameters, they let the model invent its own abstract way of describing a voice — the way an artist might describe a color by mood rather than by RGB values.

The first models were tiny. A few hundred million parameters, trained on tens of thousands of dollars of compute. Mati remembers thinking $6,000 for a patent lawyer was absurd and deciding against it. (The patent, he notes later, was never missed. By the time anyone could have copied them, the tech was already obsolete.)

They got a lot of free compute from Nvidia’s startup program. Andrej cuts in dryly to note that world is gone now — then corrects himself and tells the Stanford students they probably still have access to free credits, so use them.

The Javier Milei moment

Over the next three years the research broadened. 2022 was text-to-speech. 2023 was voice cloning and a marketplace where creators could license their own voices. Then in 2024 came the moment Andrej calls “the day everything changed.” One morning he woke up to six different people sending him a tweet. It was a video of Javier Milei, the president of Argentina, giving a speech at the UN. Except Milei was speaking English. In his own voice. With his own delivery. ElevenLabs had done it, stitching together transcription, translation, and voice cloning into the dubbing pipeline they’d originally set out to build.

Around the same time they worked with Lex Fridman on conversations with world leaders — Milei, Zelensky, Modi — letting each speak in a language they didn’t actually speak, in their own voice.

2025 was the year the whole pipeline got fast enough to run in real time, which meant you could have actual conversations with AI voice agents. This is where the company’s current bread and butter lives — replacing customer support calls, appointment booking, the stuff that used to require a human reading from a script.

Cascaded vs. fused: the architectural fork in the road

This is the part of the talk that rewards a little patience, because it explains a real tradeoff the whole field is wrestling with.

Right now, voice AI has two schools of thought. The cascaded approach is the three-buckets version. Separate models for transcription, reasoning, and speech, chained together. The fused approach throws all of that into one giant model that takes audio in and spits audio out, skipping the “text” step entirely. Think of it like the difference between a kitchen with three specialists — one for prep, one for cooking, one for plating — versus a single chef who does all of it in their head without ever writing anything down.

The fused approach sounds sexier. It’s faster, around 300 milliseconds of latency, because you’re not shipping data between three models. And it handles emotion more naturally, because nothing gets flattened into plain text in the middle.

But Mati argues the cascaded approach wins on the things enterprise customers actually care about. Here’s his reasoning, translated into human terms.

Imagine you’re calling an airline to rebook a flight. The AI needs to: authenticate you, pull up your account, check availability, process a payment, send you a two-factor code, update a database. Each of those is a tool call — the AI reaching out to another system. With a cascaded pipeline, each of those steps is visible and auditable. If something goes wrong, you can see exactly where. With a fused model, all of that tool-calling has to happen somewhere inside one giant opaque brain, and if the payment fails, good luck figuring out why.

“Cascaded models work already really well on that given the intelligence layer will be fixing that. And the fused, you sacrifice that.”

So ElevenLabs is betting on cascaded for the boring-but-critical stuff — banking, support, healthcare — while keeping an eye on fused for the use cases where low latency matters more than reliability, like a chatty AI companion. Mati guesses the eventual answer is a hybrid: fused when you’re just chatting, cascaded the moment you actually need to do something real.

On the emotion problem specifically, they recently cracked it in the cascaded architecture by having the transcription model detect feelings (stressed, peppy, sad) and pass those as hints to the speech model. Getting there required building an entire human-labeled dataset of emotional speech — because you can’t train a model to recognize “stressed” unless someone has labeled a lot of stressed-sounding audio first. This is one of those unglamorous bottlenecks that nobody talks about at conferences: progress was blocked for months not by compute or architecture but by needing actual humans to sit down and tag data.

The business, which is absurd

ElevenLabs finished 2025 at $330 million in annual recurring revenue. By the time of this talk — early 2026 — they’d added another $100 million in a single quarter. $430 million+ in three and a half years from founding, with about 450 people. Andrej, who is usually composed, asks if they can pause for a second and acknowledge this.

The team is structured in small cells of under ten people, each with real decision-making authority. Mati is emphatic about this: the speed of learning matters more than process. The bases are London, New York, Warsaw, and San Francisco, with more people in Europe than most American AI companies.

On pricing, Mati offers one line that’s worth underlining for anyone building anything:

“Never start from the cost, start from the value and work backwards from there. If you deliver the value, you roughly want to capture one-tenth of what you delivered.”

Translation: don’t price by how much it costs you to run the model. Price by how much the customer got. If your AI saves a customer support team $10 million a year, charge them a million. The rest is marketing.

The parts that aren’t about money

Two moments in this talk land harder than the revenue numbers.

First: ElevenLabs has restored voices for around 10,000 people who lost theirs to ALS or throat cancer. Give the model enough recordings from before the illness — home videos, voicemails, whatever exists — and it can reconstruct a synthetic version faithful enough that the person can talk with their own voice through a phone or tablet. Mati mentions this almost in passing. It’s the proudest work they do.

Second: they worked with the Ukrainian government on an app called Diia, a single citizen portal that handles passports, benefits, education, and government services. When the war displaced millions of Ukrainians, a lot of people couldn’t reach local offices anymore. ElevenLabs added voice access so people without smartphones could call a phone number and get the same services. Mati visited Kyiv to work on it. He notes that every Ukrainian ministry had its own technical team able to move fast — no red tape, no consultants — and suggests this might be what modern government should look like.

He also mentions, with the same flatness, that the company has decided to be “Western-allied” as a matter of policy. They actively fight distillation attacks (where a foreign lab trains a cheaper knockoff by querying your model a million times and learning from the outputs). And they think voice authentication for banking is now a dead idea — if a model can clone a voice from a minute of audio, it shouldn’t be guarding your money.

One genuinely fun detail

A charity working with ElevenLabs built a system that detects likely phone scammers based on the IP address of the incoming call. When it spots one, instead of sending the call to a human, it routes the scammer to an AI voice agent whose only job is to waste their time. Mati says the transcripts are some of the funniest conversations they’ve seen. The scammers never figure it out.

Claude’s Take

This is a high-quality founder talk with minimal self-mythology, which is rarer than it should be. Mati comes across as exactly what you’d expect from an ex-Palantir engineer who grew up in Poland: precise, unshowy, willing to credit competitors, careful about what he actually knows versus what he’s guessing.

The strongest parts are the technical explanations of cascaded versus fused architectures. This is a real tradeoff being argued about in the field right now, and his framing — reliability and auditability for enterprise, speed and emotion for casual use, hybrid in between — is the consensus position among people who actually ship voice products, not the hype position. He’s not saying cascaded is better forever. He’s saying it’s better for the next few years for the specific kind of customer who pays real money. That’s a defensible claim, and he’s putting his own company’s bet on it.

The part to be slightly more skeptical of is the revenue predictability story. $330M to $430M in a quarter is extraordinary, and the explanation — “we have forward-deployed engineers who work alongside enterprise customers” — is roughly what every AI company says right now. It’s probably true that the deployment model scales with headcount in a reasonably linear way. But scaling revenue 30% in a quarter is not a repeatable engine so much as a moment in a market that’s still expanding faster than any individual company can keep up with. When growth slows (and it will), we’ll learn how much of this was durable customer value versus everyone in the enterprise scrambling to buy voice AI before their competitors do. Probably both, in some unknowable ratio.

The claim that emotion is “largely fixable in both approaches” is the bit I’d watch most carefully. Today’s cascaded emotion detection — transcribe audio, tag it with a feeling, pass that as a parameter — is a clever workaround, but it’s fundamentally different from a model that natively experiences and reproduces prosody. Whether those two approaches converge to the same quality or whether fused models pull decisively ahead is genuinely unknown. Mati is betting on convergence. Reasonable people disagree.

The collaboration-over-competition framing is sincere but worth noting as a luxury belief that becomes much easier when you’re already the category leader. It’s true that ElevenLabs and Sesame helping each other is better for the field. It’s also true that ElevenLabs, at $430M ARR and growing, can afford to be generous in a way a struggling competitor couldn’t. Generosity and market dominance are often complements, not contradictions.

The ALS voice restoration work and the Ukraine government deployment are real and verifiable and do not require any skepticism at all. They are unambiguously good uses of this technology, and the fact that ElevenLabs leads with them rather than buries them says something real about the company’s priorities.

One genuinely novel observation in the talk, worth stealing: the idea that in AI-assisted creative work, the right model is middle-to-middle, not end-to-end. You bring your story, the AI helps you generate and iterate, you refine the output. The end-to-end version — type a prompt, get a finished movie — is where AI slop lives. The middle-to-middle version is where actual craft lives. That distinction is going to matter more and more as these tools get better, and I haven’t heard it phrased that cleanly anywhere else.