heading · body

YouTube

An Interview with Josh Fisher | Inventing VLIW, Multiflow, Itanium, VLIW's Massive Success

Asianometry published 2026-05-01 added 2026-05-05 score 9/10
computer-architecture vliw chips history compilers embedded itanium interviews
watch on youtube → view transcript

ELI5/TLDR

Josh Fisher invented an idea for how to build chips called VLIW — Very Long Instruction Word. The pitch: instead of making the chip a clever in-the-moment juggler that figures out at runtime which little operations it can do in parallel, hand the chip a single fat instruction that already says “do all of these eight things together.” The cleverness moves out of the silicon and into the compiler. Fisher’s startup Multiflow built the first practical VLIW computer in the 1980s, then died for ordinary business reasons. Intel and HP later bet the farm on VLIW for the desktop with Itanium, and that flopped famously. The video’s title is half a joke and half a fact: VLIW lost the desktop war but won everywhere else. Roughly 12 to 15 billion VLIW processors now ship every year — inside your phone’s Bluetooth chip, the cable box, your car, Google’s TPUs, and most of the AI accelerators training the models you use.

The Full Story

What VLIW actually is, before the war stories

A normal modern CPU is a juggler. You hand it a stream of small instructions — add this, multiply that, load this number from memory — and the chip itself, at runtime, looks ahead a few dozen instructions and tries to figure out which ones don’t depend on each other and can therefore run at the same time on different little execution units inside the chip. That’s called a superscalar processor. It is the standard way an x86 or an Apple chip works. A huge chunk of the silicon in such a chip is not doing math; it’s the supervisor figuring out what to send where, what to do if a guess turns out wrong, what to do when memory is slow. Clever, expensive, hot, and unpredictable.

VLIW does it the other way. The compiler — the program that turns your source code into machine instructions — looks at the program ahead of time, finds the pieces that can run together, and bundles them into one giant instruction. The chip then has nothing to figure out. It just reads the bundle and executes everything in it in lockstep. No supervisor. No guessing. No expensive bookkeeping silicon. Less heat, less area, perfectly predictable timing. The price: the compiler now has to be very smart, and the program has to be recompiled if the chip changes shape.

That’s the whole tradeoff. A superscalar puts the brains in the hardware. A VLIW puts them in the software.

Where the idea came from

Fisher was a graduate student at NYU’s Courant Institute in the late 1970s, which happened to be one of the world’s best places for compiler research at the time. He was working on something called horizontal microcode — a kind of low-level circuit description where engineers were already writing instructions that controlled many things at once. The way they did it was awful: by hand, piece by piece, like a jigsaw puzzle. His analogy:

as if you built you did a jigsaw puzzle and there was a little space enough for one piece and you have one piece in your hand and it doesn’t fit that space. you know, the next thing you do is you take the whole puzzle and you just jumble it up and start again because you couldn’t unwind all of that.

Fisher’s insight was that this was a compiler problem in disguise. He invented a technique called trace scheduling that could automatically take ordinary sequential code — code in C, say — and pack it into wide parallel instructions. Then he realized the existing wide-instruction hardware (signal-processing boxes attached to GE’s CAT scanners, certain Cray-style scientific computers) had been designed by hardware people who never thought about compiling for them, so the architectures were impossible to compile for. So you needed new hardware designed around the compiler. He named it VLIW.

Multiflow: the technology worked, the business didn’t

Fisher and two colleagues left Yale and started Multiflow in the early 1980s. They built a machine that could issue 7, 14, or even 28 operations every cycle. For comparison, a typical PC of the era issued one operation per cycle. Their compiler — the Bulldog compiler, written for John Ellis’s PhD thesis, which won best computer-science thesis in the world that year — was the engine that made it all work.

Their pitch was a “mini-supercomputer”: Cray-class performance for a fraction of the price. They had real customers. They had the best price-performance numbers in their segment. And then they went out of business in early 1990.

Fisher is unusually clear-headed about why. The technology didn’t fail. The business did, for two reasons:

we never lost sales because of performance. Just didn’t happen. We were fast and cheap.

First reason: the killer micros came along. Sun and Apollo workstations and the early PCs were proving you could squeeze a complete fast computer onto one chunk of silicon. Multiflow’s machine had 28 functional units, which meant boards full of TTL chips — far too much logic to fit on one die at the time. Investors lost interest in funding companies whose performance came from rooms full of hardware.

Second reason, more interesting: people are deeply suspicious of software. Multiflow’s secret sauce wasn’t the chips, it was the compiler. Hardware you can hold; software is “ephemeral and a way off somewhere.” Fisher compares the prejudice to old attitudes toward mental illness — if you can’t see it, it isn’t real. Their main competitor, Convex, sold a vector machine that looked structurally like a Cray, which customers already trusted. Multiflow’s “thousand-bit instructions” looked like aliens.

they really did show, I think, genuine interest in it, but they sure weren’t going to pump millions of dollars into it.

Multiflow died honorably. They paid off creditors. Connecticut law made it a felony for principals not to pay back unused vacation pay, and Fisher had to read in the morning paper whether he was going to jail. (He wasn’t.) Their compiler technology — separate from the failed hardware — was licensed widely and lived on inside many later chip companies.

Itanium: the desktop attempt that went sideways

Fast-forward to the 1990s. Hewlett-Packard wants to replace their PA-RISC processor with a 64-bit successor. Bob Rau — who had run Multiflow’s only real VLIW competitor, Cydrome, and is described by Fisher with rare warmth as one of the finest people he ever knew — had joined HP Labs. Rau pushed VLIW for the new HP architecture, and HP partnered with Intel. The result was Itanium.

Fisher, having just lost Multiflow, joined HP and walked into a room expecting to celebrate. A VLIW was about to replace the x86. Vindication. Then they handed him the architecture spec.

I blanched. I mean, I I was really unhappy when I saw that architecture. It really violated all of my principles of what a VLIW, never mind any computer, should look like. It had so much stuff in it.

He kept his real opinion mostly to himself, gave talks promoting Itanium, and quietly redirected his lab toward embedded VLIW work. His diagnosis of why Itanium failed is mostly economic, not technical:

there was a massive application base and by then of stuff people expected to do with their computers and to have any system that you need to recompile for stopped being really practical by the 1990s.

By the time Itanium shipped, the world had Windows and Linux and a giant pile of x86 software. To switch to a new architecture, every application had to be recompiled — VLIWs especially require careful per-machine compilation, because the compiler has to know exactly how many functional units the chip has. Itanium tried to run x86 software through emulation and binary translation, but that technology was immature in 2001. The chip was late, slow on x86, and its 64-bit extension wasn’t really x86 anyway. ARM eventually pulled off the install-base-replacement trick on the desktop two decades later, but only because emulation finally got good enough.

if DraftKings or that big Hong Kong betting consortium ever wanted to publish a line on it, I’d bet pretty strongly against VLIW ever being in general purpose for as long as we still have general purpose computers.

Where VLIW actually won

Now the punchline. Fisher estimates 12 to 15 billion VLIW processors ship every year — perhaps 45 times the count of x86 chips. They are nearly invisible because they live inside other things.

embedded is the computers that are built into things. They’re not the user does not directly knowingly use the computer. The user uses his refrigerator uses their refrigerator or uses drives their car

Embedded computers don’t need to run the world’s existing software base. They run a fixed, known set of jobs — decode video, drive a printer head, run a Bluetooth radio, do signal processing for a cell tower. For those jobs, VLIW’s tradeoffs become advantages instead of liabilities:

  • Lower power, less silicon. No supervisor logic to figure out what to do next means more transistors doing actual math.
  • Predictable timing. A superscalar’s runtime is partly a function of cache hits, branch predictions, memory contention. A VLIW does exactly what its compiled bundle says, every time. For real-time systems — a printer head, a radio, a car — predictability is gold.
  • Tons of parallelism in a small chip. When you can do eight or sixteen operations per cycle without spending area on bookkeeping, you crush DSP-style workloads.

The list of places this won: Tensilica and Cadence’s customizable cores in cars and IoT; the ST Microelectronics ST2x1 family Fisher’s own group helped build (used in cable set-top boxes, base stations, HP’s printers); Bluetooth radios; storage controllers; and most consequentially, AI accelerators. Google’s TPUs — the chips that train Gemini — are VLIWs. Most of the giant matrix-multiply silicon chasing Nvidia is structurally VLIW or close to it. The architecture lost a beauty contest on the desktop and won the actual world.

The miss that haunts him

Around 2000, Fisher thought he saw the future. HP had VLIW expertise nobody else had, plus the iPaq and Jornada PDAs as hardware platforms. He proposed putting VLIW chips inside a converged device that could play movies, do language translation, run any media application — essentially a smartphone, seven years before the iPhone. He took the idea to HP’s marketing department under the new Carly Fiorina regime.

the response I got back was no no that’s not what’s going to happen what’s going to happen is people are going to buy individual devices if they want to see movies they’ll buy a device that shows movies in their hands if they want to do language translation They’ll buy a language translation device

He didn’t get to build it.

Key Takeaways

  • VLIW = compile-time parallelism, superscalar = runtime parallelism. Same end goal (do many operations per cycle), opposite philosophy on where the smarts live.
  • The “compiler is too slow” critique is wrong. Any architecture that finds that much parallelism needs that much compile-time work — a hypothetical superscalar with the same parallelism would need an equally smart compiler to feed it. VLIW just makes the work visible.
  • A failed business doesn’t mean a failed technology. Multiflow died because killer micros put complete computers on one chip and investors stopped funding board-level supercomputers. The Multiflow machines themselves were beautifully engineered and won every performance bake-off they entered.
  • Itanium failed for ecosystem reasons, not VLIW reasons. By the 1990s, replacing x86 meant either recompiling every application on Earth or doing fast emulation, and 1990s emulation wasn’t fast enough. The architecture itself was, in Fisher’s own opinion, also overbuilt — he disliked it from day one.
  • Embedded is where invisible computing lives. Phones, cars, IoT, printers, base stations, AI accelerators. Counted by units, this dwarfs the desktop+server market that gets all the press.
  • VLIW’s structural wins for embedded: low power, small area, predictable timing, lots of parallelism on cheap silicon. These also happen to be exactly what AI accelerators want, which is why TPUs ended up as VLIWs.
  • Fisher’s law of mergers: when two companies merge, the worse culture wins. He got this from watching HP/Compaq from the inside.
  • A side prediction: VLIW may have a third life in quantum computing’s pulse control, where many precisely-timed operations need to fire at once with minimal hardware. The tightly-choreographed-and-static fits VLIW like a glove.

Claude’s Take

This is a great interview because Fisher is honest in a way most retired technologists aren’t. He had a working theory thirty years ago, watched it lose two desktop wars, and is now sitting on the inarguable result that it won the silent majority of computing instead. He doesn’t claim he saw embedded coming. He admits Itanium felt like vindication right up until he read the spec. He throws his old colleagues under the bus, then catches himself and says “I made plenty of mistakes too.” He plugs his wife’s book about Multiflow about six times in 88 minutes.

The substantive insight is the one he keeps circling back to: people confuse business outcomes with technical outcomes. Multiflow’s hardware was excellent and its hardware company died. Itanium’s strategy was bad and people inferred VLIW must be bad. Meanwhile the architecture quietly took over the parts of computing where it didn’t have to fight an installed base. There’s a useful lesson in there about how technologies actually win — not by displacing entrenched standards on their home turf, but by finding the niche where their tradeoffs flip from liability to advantage.

The other thing worth chewing on is the hardware-versus-software prejudice. Fisher’s point that investors and engineers trust hardware they can hold over software they can’t see is sharp, and probably still true. A lot of why GPUs beat alternative AI architectures, and a lot of why people give Nvidia’s CUDA more credit than its silicon, comes down to which side of that divide a story lives on.

A 9. Great storyteller, real history, useful frame for thinking about why some technologies “fail” and then keep showing up everywhere.

Further Reading

  • Elizabeth Fisher, The Multiflow Computer: A Startup Odyssey — Fisher plugs his wife’s book repeatedly. It’s apparently the inside-the-startup view of Multiflow.
  • Bob Colwell, The Pentium Chronicles — Colwell was a Multiflow engineer who later led Intel’s P6/Pentium Pro design. Useful for context on why x86 went superscalar.
  • Hennessy & Patterson, Computer Architecture: A Quantitative Approach — the canonical textbook. Has the cleanest explanation of VLIW vs. superscalar tradeoffs.
  • Fisher’s original 1981 paper, “Trace Scheduling: A Technique for Global Microcode Compaction” — the algorithm that made VLIW compilation possible.
  • Asianometry’s earlier video on VLIW — the prequel that made Fisher reach out for this interview. Also Asianometry’s Itanium video, referenced throughout.