FFmpeg: The Incredible Technology Behind Video on the Internet | Lex Fridman Podcast #496

ELI5 / TLDR

Almost every video you watch online — YouTube, Netflix, a clip on your phone — passes through a piece of free software called FFmpeg, and its sibling, the VLC media player (the one with the orange traffic cone). Both are built by a tiny handful of unpaid volunteers, yet they run on billions of devices. This is a four-hour conversation with two of the people behind them about how video actually gets squished down small enough to send over the internet, why a few people hand-writing the most low-level code imaginable still beats giant corporations, and why the guy who built VLC turned down tens of millions of dollars to keep it free and ad-free.

The Full Story

What happens when you press play

Start with a simple mystery. You double-click a video file and a few hundredths of a second later, moving pictures and sound appear. What just happened?

The file is a sealed box. The first job is to open the box and find what’s inside — that box is called a container (the part of the filename you know: .mp4, .mkv, .mov). Inside the container are separate streams: a video track, an audio track, maybe subtitles. Prying them apart is called demuxing — think of it like a postal sorting office splitting one mixed mailbag into separate piles.

Each pile is still compressed. To turn it back into something you can see or hear, you run it through a codec — the word is just “coder/decoder” smashed together, the thing that compressed it and the thing that unpacks it. Only after the codec has done its work do you get raw images and raw sound, which then go to your graphics card and your speakers.

One thing the hosts hammer home: the filename lies. A file called .mp4 might actually be something else entirely. Most players, when they hit that mismatch, just break. VLC’s whole personality is that it doesn’t trust the filename — it looks inside and figures out what the file really is. That habit comes from its origin streaming video over a network, where packets arrive damaged and you have to play whatever you can salvage. As one of them put it:

“Everything in VLC is prepared to work with broken files. And it’s a philosophical idea from the beginning.”

Why compressing video is hard in a way zipping a file is not

Here is the key idea, and it is genuinely surprising. When you zip a document, nothing is lost — data goes in, the exact same data comes out. Video compression is not like that. It throws information away. On purpose. A lot of it.

“People don’t realize how compressed we do… when you move to video, you need one hundred times, two hundred times.”

How do you throw away 99% of the data and have it still look fine? The trick is that the only audience is the human eye, so you only need to keep what the eye notices. Imagine you have a camera panning across a sky. The clouds in the background barely change for thirty seconds — so why store them thirty times? Store them once and say “same as before” for every frame after. That is temporal compression (across time). Then within a single frame, a black wall is the same black in thousands of pixels — so don’t list every pixel, just say “this whole region is the same.” That is spatial compression (within one image).

They go further. Your eye is much better at noticing brightness than color, so video doesn’t even store pictures the way you’d expect (red-green-blue). It splits each image into brightness on one side and color on the other, then quietly halves the resolution of the color — and most people never see the difference. That one move alone shrinks the file by a third before any real compression even starts.

There’s a beautiful structure underneath this, worth slowing down for. Frames come in three flavors:

An I-frame is a complete picture, like a single JPEG. You can start watching from one.
A P-frame is a predicted frame — it doesn’t store a full picture, just the instruction “take blocks 5, 7, and 42 from the previous frame, and here’s the little bit that changed.”
A B-frame is the mind-bending one. It can depend on a frame in the future. To decode it, the player has to peek ahead, decode a later frame first, then come back. The order frames are decoded in is not the order you watch them.

“B-frames can depend on frames that are coming in the future… the decoding order is not the same as the display order.”

That all of this happens flawlessly, billions of times a day, on machines made by different manufacturers — and they all produce bit-for-bit identical results — is, as they keep saying, quietly miraculous. (That guarantee, called bit exactness, was a hard-won lesson; the 1990s MPEG-2 standard didn’t require it, and they call that one of the industry’s big mistakes.)

FFmpeg and VLC: a binary star system

So where do these two famous projects sit? FFmpeg is the engine room — the low-level library of codecs, muxers, and filters. It’s inside everything: Chrome, smart TVs, OBS, and yes, VLC. There’s also a command-line tool, also called ffmpeg, so powerful that people describe it as a programming language in its own right — you can do After Effects-style video editing with a single (sometimes thousand-character) line of text. So many people use AI now just to generate FFmpeg commands, because the option list is overwhelming.

VLC is the player most people actually see — the orange cone. The two are not competitors and not the same thing. The analogy they use: VLC is to FFmpeg as Android is to Linux. They depend on each other and grew up together — a “binary star system.” A common online jab is “VLC is just FFmpeg inside doing the real work,” and both hosts push back firmly: they feed each other. (VLC also leans on another VideoLAN project, x264, the open-source encoder behind a huge share of the H.264 video on the internet.)

What’s philosophically lovely, one host notes: your grandmother’s home video and a trillion-dollar corporation run on the exact same technology stack. FFmpeg democratized something that in the 1990s required equipment the size of a car costing hundreds of thousands of dollars. The podcast and YouTube revolutions happened partly because this got free.

The open-source story, and the cheesecake

When they explain what “open source” means, JB reaches for food. A normal company sells you the finished cheesecake. Open source gives you the cheesecake, the recipe, instructions for building the oven, and permission to change the recipe and sell your own version. Software, after all, is just a very long recipe — a normal program is tens of billions of tiny instructions instead of the dozen steps in grandma’s recipe.

The license is the glue. It’s a kind of social contract — the one thing thousands of contributors, from all over the world, across every political and religious line, actually agree on. And it has teeth: when JB wanted to relicense VLC’s core to be slightly more permissive (so it could be embedded in commercial apps and pass Apple’s App Store rules), he legally needed the agreement of every contributor — because each person keeps the copyright on their own lines forever. He had to track down 350+ people. One story stays with you: he traveled to a factory to get a signature from a father whose son had written the code and had since died, and had to gently explain what this stranger was even asking for. He was 21, nearly in tears.

“We are talking about lives of people… So it’s important to do it right.”

Excellence, harshness, and teenagers writing assembly

The recurring theme: a tiny core (5 people for VLC, 10–15 for FFmpeg) maintains code that a thousand people pass through. Since only about 1% of contributors ever stick around, the maintainers will be the ones living with your code for years. So the bar is brutal: is the code excellent? Nothing else matters — not your job title, not where you’re from.

“Maybe you’re a dog. I don’t care… I need to look at your code.”

This is why the tone online can be harsh (Linus Torvalds is the patron saint of this — they note he built Git in two weeks, and argue that’s arguably more world-changing than Linux itself). But the flip side is radically meritocratic: teenagers have written thousands of lines of the hardest code in these projects. One contributor was 14. “Teenagers have written more assembly in FFmpeg than Google engineers.”

The assembly heresy

This is the spiciest and most counterintuitive thread, and worth fermenting carefully.

Most programmers write in a high-level language (C, Python) and trust the compiler — a translator program — to turn it into the actual instructions the chip runs. The orthodox belief in software is: compilers are smart and only getting smarter, so hand-writing the low-level stuff yourself is a waste of time; you can never beat the machine.

The FFmpeg crowd says: not even close. They hand-write assembly — code written directly in the processor’s own instructions, no translator in between — and routinely get speedups of 10x, 50x, even 62x over what the compiler produces. Not 5% faster. Multiple times faster. They’ve spent two years posting hundreds of examples on Twitter, and software engineers keep insisting it’s impossible.

Why does it matter so much? Because of scale. A video decoder like dav1d (their decoder for the modern AV1 format) runs on roughly 3 billion devices, decoding nonstop — 30% of Netflix and 50% of YouTube is now AV1. At that scale, every single CPU cycle saved is multiplied a billionfold. dav1d is 30,000 lines of C and 240,000 lines of hand-written assembly — likely one of the largest assembly codebases anywhere.

The flavor of assembly they use is called SIMD — “single instruction, multiple data.” Normally, to add 5 to a number, you do one addition. With SIMD, one instruction adds 5 to sixteen numbers at once. Video is a grid of pixels, so doing the same operation to many pixels simultaneously is a perfect fit. They go even further — abusing the machine, using a cryptography instruction to do something completely unrelated to cryptography, and inventing their own rules for how functions hand data to each other (the “calling convention”) to shave off microseconds.

The deeper point JB makes connects to this exact moment in computing: Moore’s Law is ending. Hardware isn’t getting dramatically faster anymore — we just bolt on more cores, and that has limits. The demand for compute (especially for AI) is exploding while the chips aren’t keeping pace. So the only way forward is to descend the stack and squeeze every drop out of the hardware you already have. He draws a direct line to AI: the same way people quantize large language models down to 4-bit to fit them in memory, the future will require this kind of low-level optimization. And it’s the one thing, he argues, that can’t be vibe-coded.

“The core thing that you will not be able to vibe code are optimization for the hardware to be as fast as is possible.”

Reverse engineering: archaeology with a debugger

Some of the most revered people in the community do reverse engineering — figuring out a secret, undocumented codec with no source code, just a binary blob (sometimes 20–30 megabytes; one megabyte is roughly a month of work). The legendary example is a Ukrainian engineer, Kostya, who cracked obscure formats like GoToMeeting’s recording codec for fun, and left jokes in the code.

The process sounds insane when described: you load the program into a tool that shows raw machine instructions, you pause the program one instruction at a time, watch what each one does to memory, and slowly infer the logic — like an archaeologist with a tiny brush reconstructing a civilization from fragments. For a long stretch you’re “debugging purely in memory,” seeing nothing, not knowing if you’re even close. Why bother? Because in 15 years that GoToMeeting.exe won’t run on your phone or your future computer — but the recordings still exist, and someone will want to watch them.

Saying no — to money, to backdoors

Two principled refusals anchor the episode.

First, the famous one: JB turned down tens of millions of dollars (a Reddit meme by now) to keep VLC free and ad-free. The offers were always shady — bundled toolbars, spyware, ad injection — never something clean like “put Netflix inside VLC.” His logic was simple: he wanted to sleep at night, and selling out would betray everyone who’d contributed. “I want to be proud of what I’ve been doing.”

Second, intelligence agencies asked — twice — to put a backdoor in VLC. The answer was an emphatic no (“if we had to compromise our software, we would shut it down”). They compile VLC with near-paranoid security: on machines that have never touched the internet, starting by compiling the compiler itself. And because it’s open source and international — contributors in the UK, Germany, the US would all see any sneaky change — secretly slipping in surveillance code is nearly impossible. VLC has no telemetry, no servers, no idea what you watch. When police once came with a murder case, all VideoLAN could do was help them play a corrupted file — they never see the content.

The maintainer-burnout problem, and a Google fight

A recurring worry: the whole internet rests on a few exhausted volunteers (the classic xkcd image of all modern infrastructure balanced on one tiny unmaintained project). The episode covers a recent public fight with Google’s security team, who used AI to mass-generate scary “critical vulnerability” reports — on a 1990s game codec used on a single disc — and went to the press to brag about their AI before the volunteers could even respond. The hosts’ frame: it’s like having infinite resources to pick the lock on someone’s hobby-project shed, then loudly declaring the lock unsafe, while contributing neither money nor fixes. The good news: the spicy public pushback worked — Google started sending patches and donations rose (though still not enough for even one full-time developer). The grim adjacent example is the XZ fiasco, where a single burned-out maintainer was socially engineered into handing commit access to an attacker.

Where it’s all going

They close on the future. FFmpeg already runs on the Mars rover (“a multi-planetary open source library”) and is used by CERN, SpaceX, and Formula 1 teams. The bigger idea: “multimedia” just means a digital representation of streams for the human senses. So as new senses get digitized, FFmpeg will absorb them — point clouds and volumetric video for robots, haptics for theme-park rides (already a VLC plugin), and eventually, only half-jokingly, codecs for brain-computer interfaces. “You’ll have FFmpeg -i input format human brain.” “Stereo smell.” The job of the tiny core team isn’t to write every new format — it’s to keep the architecture clean enough that the world’s contributors can plug the future in.

Key Takeaways

Over 90% of online video processing touches FFmpeg; VLC has been downloaded 6.5+ billion times. Both are built by a core of fewer than 15 unpaid volunteers.
Video compression is lossy — unlike a ZIP, it permanently throws away data. It’s tuned entirely to human perception (the eye notices brightness more than color), achieving 100x–1000x compression.
A container (.mp4, .mkv) is the box; a codec (H.264, AV1) is the compressor inside it. The filename routinely lies about its contents — VLC inspects the actual bytes rather than trusting the extension.
Three frame types: I-frames are complete images; P-frames copy-and-tweak the previous frame; B-frames can depend on future frames, so decode order ≠ display order.
Bit exactness: modern codecs require every decoder, on any hardware, to produce identical output bit-for-bit. MPEG-2 in the ’90s didn’t, which they call a historic mistake.
FFmpeg and VLC are a “binary star system” — interdependent, not competitors. x264 (open-source H.264 encoder) is the third key VideoLAN project.
Hand-written assembly (specifically SIMD — one instruction operating on many pixels at once) beats compiler-generated code by 10x–62x. dav1d, the AV1 decoder, is 240,000 lines of it.
The economic logic: encoding is expensive and done once; decoding happens billions of times, so every CPU cycle in the decoder is worth optimizing obsessively.
Moore’s Law is ending — hardware isn’t getting much faster, so low-level optimization is becoming more valuable, including for AI. It’s the one thing that can’t be “vibe-coded.”
Reverse engineering a closed codec means stepping through a multi-megabyte binary one machine instruction at a time, inferring logic from sample files — done to keep old formats playable far into the future.
Each new codec generation (H.264→H.265→H.266, or AV1→AV2) gives roughly 30% better compression for the same quality, at the cost of dramatically more encoding compute.
The “AV” codecs (AV1, AV2) exist to dodge a patent minefield — H.265/HEVC licensing got so expensive (potentially hundreds of millions/year for YouTube or Netflix) that the big companies built their own royalty-free alternative.
Relicensing an open-source project requires consent from every contributor, because each keeps copyright on their own lines forever — JB once tracked down 350+ people.
The community is radically meritocratic: only code quality matters, and teenagers have written some of the hardest assembly in FFmpeg.
JB turned down tens of millions to keep VLC ad-free; the project refused two intelligence-agency requests for backdoors. VLC has zero telemetry and never sees what you watch.
An archiving community (lossless FFV1 codec) treats FFmpeg as a “Rosetta Stone” so the 20th/21st century’s video can still be played 1,000 years from now — they argue C will survive like Latin.

Claude’s Take

This is a strong episode, and the score reflects substance over polish. The two guests genuinely know their domain at the deepest level, and Fridman mostly stays out of the way and lets them talk — which is the right call when the material is this rich. The technical explanations of compression, frame types, and assembly are clear enough that a non-engineer can follow them, and the human stories (the factory-worker signature, the death threat over dropping PowerPC support, refusing backdoors) give it real weight.

The BS filter, honestly applied: there’s some self-congratulation. The “assembly is everything” thread is real and important, but it’s also their identity and they ride it hard — for the vast majority of software, the compiler-is-fine crowd is correct, and the cases where 62x matters are genuinely special (billions of devices, real-time constraints). They acknowledge this, but a casual listener could walk away thinking everyone should hand-write assembly, which would be a mistake. Similarly, the Google security fight is presented mostly from their side; the security researchers’ position (that finding real bugs in ubiquitous infrastructure is a contribution) gets a fairer hearing than the FFmpeg Twitter account usually gives it, but it’s still their home turf. And at four hours, it meanders — the macro-economic and Europe-vs-entrepreneurship tangents add little.

What earns the 8 rather than something lower: the core material is durable and the worldview is coherent and admirable. The argument that low-level craft is becoming more valuable as Moore’s Law dies and AI’s hunger grows is the most quietly important idea here, and it’s not hype — it’s a real bet about where the next decade of computing constraints bite. The archiving-as-Rosetta-Stone framing alone is worth the price of admission. If you’ve ever pressed play and not thought about what happened next, this rewires that.