heading · body

YouTube

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Enterprise Internal Knowledge

Stanford Online published 2026-05-22 added 2026-06-03 score 7/10
ai llm enterprise reinforcement-learning post-training machine-learning stanford startups
watch on youtube → view transcript

ELI5/TLDR

A big general AI model is like a brilliant new hire who knows everything on the internet but nothing about your company. This lecture’s guest, Yash Patel — a 2025 Stanford grad who went to OpenAI and then started a company called Applied Compute — explains how you take that genius and teach it your specific business. The trick is not clever instructions; it’s defining exactly what a “good” answer looks like for you and then letting the model practice the task thousands of times against that definition. Surprisingly, this custom training now costs only a fraction of building the original model, which is why companies bother doing it instead of just waiting for the next, smarter model.

The Full Story

This is the model-layer session of the Stanford course. The guest is Yash Patel, who joined OpenAI’s post-training team in 2023, helped build the agentic-coding research that became Codex, then left to found Applied Compute. The whole conversation circles one idea: the world’s most valuable data lives inside companies, and the frontier models don’t know any of it.

A genius who knows nothing about your business

Patel’s framing for why his company exists:

these models were getting really really smart, but when you actually went to go and apply them inside of the enterprise, they’re like smart geniuses that know nothing about your business.

The data point underneath this is that proprietary corporate data dwarfs anything public. The public internet trained the base models, but the real depth — JP Morgan’s standards, DoorDash’s menu rules — sits locked inside individual companies. So the business is taking the same frontier training techniques the big labs use and pointing them at one company’s private definition of “good work.”

Two stages: pre-training and post-training

To follow the rest, you need two words. Think of building a model as two phases.

Pre-training is the giant, expensive phase. You take the whole internet — trillions of word-pieces, called tokens — and make the model play one endless game: guess the next token. Guess, check against the real text, nudge the model’s internal dials, repeat. Patel calls what comes out of this “compression”: all of human knowledge squeezed into a set of numbers (weights) that has somehow absorbed the patterns of language. Intelligence falls out as a side effect.

Post-training is the much cheaper finishing phase. A raw pre-trained model just continues text. Ask it “who should I invite to dinner?” and it might spit out random names, because it’s predicting plausible next words, not answering you. Post-training teaches it manners: the back-and-forth chat format, safety limits, and the difference between a good answer and a bad one.

How cheap is “cheaper”? Patel looked up the numbers for the DeepSeek models on his way to class. Pre-training the base model took about 2.4 million GPU-hours; the reinforcement-learning step that made it a reasoning model took about 150,000 — roughly 5%.

comes out to about 5% of the training compute that’s needed for pre-training.

That 5% is the whole commercial opening. You don’t need a frontier lab’s budget to specialize a model; you need a fraction of it. (He notes the share is climbing as labs pour more compute into this stage, but it’s still an order of magnitude cheaper than starting from scratch.)

Define “good,” then let it practice

The core mechanism has an ugly name — reinforcement learning with verifiable rewards, RLVR — but a simple shape. Instead of writing instructions, you give the model a task, let it attempt the task hundreds or thousands of times, and after each attempt you automatically check whether it got it right. Right answers get reinforced, wrong ones discouraged. The model climbs toward “good” on its own.

The catch is the word verifiable. You need an automatic, no-human way to score each attempt. This is why every lab fixated on code first:

Code and math are really, really good for this because what can you do? You can compile the code, you can run unit tests against it.

Code grades itself — it either passes the tests or it doesn’t. Most real-world work doesn’t come with a built-in answer key, which is the hard part of the whole field.

The DoorDash example makes it concrete. DoorDash onboards 100,000-plus merchants a year, each dumping in messy menu photos that have to become a clean digital storefront following DoorDash’s own fussy style rules about add-ons and modifiers. General models couldn’t do it, and prompting didn’t fix it. The fix: take the model’s menu outputs, have humans correct them, measure the gap, and train the model directly against shrinking that error rate. No clever prompting — just a clear scorecard and repetition.

Evals are the secret weapon

If RLVR is the engine, an eval is the steering wheel. An eval is just a test that defines what good looks like for a given task. Patel argues it’s the most guarded asset at the labs, and the logic is tidy:

whatever hill you want to climb, you first define it with an eval, then RL is kind of this like eval maxing machine.

In other words: write the test, then let the training machine relentlessly optimize toward passing it. This creates a layered world. The big labs optimize toward their evals; individual enterprises have their own private notion of good (JP Morgan and Goldman would grade the same task differently); and Applied Compute sits in between as the specialization layer, tuning models to each company’s test.

Why bother, when GPT-17 is coming?

The obvious objection, raised in class: why customize today’s model when a far smarter one ships next year? Patel’s answer is partly about ROI — at 5% of pre-training cost, the payback is immediate — and partly a worldview. He doesn’t believe in one all-knowing model that controls everything, because the data is too dispersed:

the world is just very fragmented place and if you just look at where the data is it’s kind of dispersed.

General models, in his telling, “set the floor”; specialized ones “set the ceiling.” A second Windsurf example shows the other reason: a tiny model trained hard on one job — catching bugs the instant you save a file, in under two seconds — beats a giant general model on the three-way trade-off of performance, cost, and speed. Big models orchestrate; small specialized ones do narrow jobs fast. You ensemble them.

What comes next: continual learning

Today’s models are frozen after training. Touch a hot stove and you learn instantly from a single painful signal; these models can’t. Continual learning is the holy grail of squeezing learning out of rare, sparse feedback in the real world. Patel expects it to arrive gradually, gated mostly by data access — are you even putting the agent in front of the right people to learn from? Cursor’s Composer model is an early taste: it watches whether users accept, revert, or edit its code suggestions and nudges itself in big batches over days and weeks.

Loose ends

On architecture, Patel is a pragmatist: scaling the Transformer is working, so keep scaling it; if there’s a better design, a smart-enough AI will probably find it for us. He nods to LeCun and Ilya Sutskever on the other side, whose first-principles objection is that humans don’t need internet-scale data to learn language, so a more efficient architecture must exist. On the future, he’s long Nvidia and compute but flags the risk that the big labs in-house their own chips, and he’s wary of the RL-data business — every task you solve makes the next one harder to find.

Key Takeaways

  • A general frontier model is “a smart genius that knows nothing about your business” — the most valuable data is private and locked inside enterprises.
  • Model-building has two stages: pre-training (predict the next token over the whole internet; produces raw intelligence) and post-training (align it into a helpful, safe assistant).
  • Post-training is shockingly cheap relative to pre-training — about 5% of the compute in the DeepSeek case (~150k vs ~2.4M GPU-hours) — though its share is rising.
  • RLVR (reinforcement learning with verifiable rewards): let the model attempt a task thousands of times and auto-grade each attempt; right answers get reinforced.
  • The bottleneck is the word verifiable — you need an automatic way to score attempts. Code and math qualify (compile, run unit tests), which is why every lab tackled coding first.
  • Many researchers think coding models are “AGI-complete” — most tasks reduce to writing code, so models increasingly use code as a universal interface to act on the world.
  • Evals — tests that define “good” — are the most guarded lab asset. Define the eval first; RL is an “eval-maxing machine” that climbs whatever hill the eval sets.
  • A tiered structure emerges: labs optimize toward their evals, enterprises toward their own private evals; specialization companies sit in the gap.
  • Specializing beats waiting for GPT-17 because the world’s data is fragmented; general models set the floor, specialized ones set the ceiling.
  • Small models trained hard on one task can beat large general models on the performance/cost/latency trade-off (e.g. sub-2-second bug detection for Windsurf).
  • Continual learning — learning from sparse, real-world feedback like a single stove burn — is the next frontier; mostly blocked on data access, expected to arrive gradually.
  • DoorDash onboards 100k+ merchants/year; menu extraction was solved not by prompting but by training directly against a human-corrected error rate.
  • Patel’s bet: scaling Transformers keeps working; he’s long Nvidia/compute but wary of the RL-data market, where solving each task makes the next harder.

Claude’s Take

This is a clean, honest tour of how enterprise AI customization actually works in 2026, and it earns its keep because the guest isn’t selling magic — he keeps reducing things to one unglamorous idea: define what “good” means precisely enough that a machine can grade it, then let the machine practice. That’s a genuinely useful mental model, and the 5% compute figure is the kind of concrete anchor that makes the economics click.

Two caveats. First, this is a founder describing his own company’s thesis to a room of students, so the framing — general models “set the floor,” specialization “sets the ceiling” — is exactly what you’d expect Applied Compute to believe; it may be true, but it’s not neutral. Second, the heavy load-bearing claim that “every task is a coding task” is a researcher’s article of faith, repeated by people who came up through code, and the lecture doesn’t seriously stress-test it. The continual-learning section is appropriately humble — he flags it as gradual and data-bottlenecked rather than imminent — which I trust more than the confident parts.

A 7 because the explanations are sharp and the examples (DoorDash, Windsurf, Cursor, DeepSeek numbers) are concrete and load-bearing, but it’s one conversational lecture covering familiar ground for anyone who’s followed the post-training story, with no adversarial pushback on the founder’s worldview. Solid, not essential.

Further Reading

  • Andrej Karpathy’s write-up on RLVR (referenced in the course readings) — on what changed in 2025 when reinforcement learning with verifiable rewards came to prominence.
  • The Chinchilla scaling-laws paper (Hoffmann et al., DeepMind) — the compute-optimal recipe of scaling both model size and training data together.
  • The original OpenAI / Kaplan scaling-laws paper — the earlier result that bigger models simply get better, proven out with GPT-3.
  • “Attention Is All You Need” (Vaswani et al., Google Brain) — the Transformer paper that made language-model training scale on GPUs.