Ai Research Breakthroughs From Nvidia Research
read summary →Dear fellow scholars, [laughter] thank you so much
for coming today. I’ve got to hold on to my papers
for this one. So, I got an interesting message a
few weeks ago. Someone asked, “Károly, we have four
legendary scientists, and we want to make a round
table discussion for them. Do you know anyone
who could hang with them?” And I said, “No,
[laughter] nobody can.” Now, before they drag
me off the stage, I’ll get to learn from them
about their incredible works and how they will
enhance our lives. Don’t forget that a lot of what
you will hear today is technology that is free for
all of us forever, which is absolutely amazing.
And I have some questions for them to reveal how
all their works are connected to build a better
world for us. Now, don’t worry, I won’t be long.
Let’s see. Oh yes, this is a good one. This
is a good one. All right. Kidding. Kidding.
So, first she is a Senior Director of
AI Research and Professor at Stanford
and University of Washington. She won four
prestigious Best Paper Awards just last year.
Please give her a big hand. This
is Professor Yejin Choi. [music]
[applause]
All right, we have a situation. NVIDIA
GPUs are getting ever faster and better,
but you guys are not posting social media
fast enough to keep up with the data demand.
So, Ilasker said at his Test of Time Award talk
less than two years ago that this is a situation.
I’m excited to share what we’re doing at NVIDIA
Research in order to cope with that data shortage.
But let’s look at how LLM sausages
are made just to motivate what we do.
So it starts with pre-training on the entirety of
internet data, followed by sequential fine-tuning
on lots of curated exam-style data. Those two
together form a kind of imitation learning — you
only imitate what’s in the data. But we’ve now
all heard about RLHF — reinforcement learning with
human feedback — which allows the model to explore
for itself instead of just copying imitation.
That really opens a new era where
the model begins to learn to explore.
However, there are two questions. One is:
exploration only comes at the very end of this
pipeline — is that how it always should be? And
can we somehow inject this reasoning earlier into
the pipeline? So what we propose to do is RLP, or
Reinforcement Learning as Pre-training, a learning
objective where we start with this context.
Usually during pre-training you just predict
the next token, but instead what we do with RLP
is let the model think for itself even during
pre-training before predicting the next token.
When we do this, the results were super surprising
and exciting. We measured this before and after
training, controlling the amount of compute, etc.
All around, it’s looking very exciting. The future
will be the era of explorative learning, and we
are very excited to push that through. Thank you.
[applause]
All right. Next up, we have the Director of
Autonomous Vehicle Research. He’s working
on making software for self-driving cars a
reality and giving it to all of us for free.
He also won the Presidential Early Career
Award by President Obama. So, next up,
please welcome Professor Marco Pavone.
[applause]
Thank you. So, a core area of research
for my group entails the development of
reasoning models for autonomous driving — an area
where we’ve been actively contributing to
and sharing our work with the community.
As you may know, at the latest CES, NVIDIA
unveiled the Alpamayo open platform,
an open ecosystem of AI models, simulation tools,
and datasets to accelerate the development of
reasoning-based autonomous driving solutions. The
Alpamayo platform includes the Alpamayo-1 model,
a 10-billion-parameter chain-of-thought
reasoning visual-language-action
model built on the COSMOS Reason
visual-language foundation model;
Alpamayo-Sim, an end-to-end simulation framework
to test fully autonomous EV stacks; and the
Physical AI EV dataset, one of the largest and
most geographically diverse datasets available.
Everything is open — models, code, datasets
— available on Hugging Face and on GitHub.
Let me give you a glimpse into
the Alpamayo-1 reasoning model,
its core principles, and development tools.
The next frontier of autonomous driving is enabled
by the power to reason. This means developing
models that can think through extremely rare or
never-before-seen situations and act accordingly.
NVIDIA Alpamayo is helping make this happen
today with an ecosystem of open components
that bring together AI models with reasoning
capabilities to make decisions, closed-loop
simulation tools to test those decisions, and
massive real-world driving datasets to learn from.
Alpamayo-1, the first release
in this family, is a large
vision-language-action model that integrates
visual perception, language understanding,
and action generation with reasoning to explain
its own decisions. Alpamayo-1 can more accurately
perceive the environment, interpret context,
and anticipate risks through reasoning.
These explicit reasoning traces ensure logical
consistency and trustworthy explainability.
Since its release, the Alpamayo platform has seen
incredible adoption by customers. This
week at GTC, we are announcing several
extensions and upgrades. First, the release of
Alpamayo-1.5 — still a 10-billion-parameter model,
but now offering navigation and text-prompt
guidance, making it more flexibly steerable.
We are also releasing post-training scripts
based on customer demand — for example,
supervised fine-tuning scripts and post-training
alignment scripts — to allow developers to
customize the model to their own datasets.
We are expanding the scenes that you can test
your AV stack against within Alpamayo-Sim
and releasing reasoning labels and the
Channel-of-Thought auto-labeling pipeline, laying
the foundation to further research in this field.
To learn more, please refer to the
Face blog post I released yesterday.
Please note that all links will become
active by the end of this week. Thank you.
[applause]
All right, our next guest is VP of AI Research
and Head of the Toronto AI Lab. She is a master
of simulation research. She has a bunch of
“Professor of the Year” and Best Paper Awards, and
more faculty awards than I thought ever existed.
Please give it up for Professor Sanja Fidler.
[music]
Hi everyone, and thanks, Károly. Marco already
touched upon some key elements of AV development,
and here we’re going to focus on simulation, which
is a critical tool to test the AV software before
it hits the road. Traditionally this was based
on graphics — basically like graphics-based game
engines for robots. The key limitation was
that it relied on human-authored content,
and you just can’t author the
entire world — it wasn’t scalable.
Now, neural reconstruction like 3D
Gaussian splats changed the game.
Now you can go from a real recording
to a reconstructed simulation environment
to create new experiences for the policy.
Vehicles’ real-world data is transformed into
Gaussian-based 3D scenes using Omniverse Neural
Reconstruction, or NewRec — converting video
into photorealistic digital environments.
The reconstructed scenes are
simulated with variations
to assess the AI driver’s ability
to generate safe trajectories.
Physical AI isn’t just driving on roads — it’s
operating in kitchens, offices, and warehouses.
Each of these robots needs to be simulated
in virtual worlds reconstructed from multiple
sensors. With NewRec, we can rapidly render these
simulations in real time with high fidelity.
New capabilities like physics bring realistic
interaction to 3D Gaussian scenes, letting
robots engage naturally with virtual environments.
Generative AI uses the visual properties in these
simulations to add diversity, scaling a single
scene into many. Neural reconstruction is already
advancing the world’s physical AI. And now, with
just a text prompt, NewRec and COSMOS can generate
3D simulation environments that refine testing and
validation for the next generation of robotics.
This stack was integrated into
the internal production stack for
simulation. It’s running two million tests
per day, providing value for AV development.
So, the question is, what’s
next? Black screen is next.
We’ve entered the era of generative simulation.
The idea here is training a world model on massive
amounts of visual data to basically learn
how to simulate — it’s purely bound by data.
This approach can generate completely novel,
challenging scenarios — like rain, snow,
or a mattress falling onto a car.
Last year at GTC, NVIDIA announced COSMOS — thanks
to Mingu somewhere in the audience. It was the
first generative simulation of its kind. At that
time, it took several minutes to generate only a
few seconds of video. Today, we’re announcing
Alpamayo Dreams — real-time, interactive,
and the first user to test it is our first
employee, really our favorite NVIDIAan — Jensen.
Alpamayo Dreams simulates
multiple cameras in real time,
reacting to the policy. They run
in closed-loop and are super easy
to edit. You can change the weather, add objects,
and test the policy in different conditions.
We have a demo at the booth —
please join me after this talk!
[applause]
How cool is that? And finally, he is leading
the Simulation and Behavior Generation team.
He simulated one thousand simultaneous
nut-and-bolt interactions in real time
on a single GPU. Please
welcome Dr. Yashraj Narang.
All right, let’s talk about robotics research.
I’m going to take you on a small journey
with a heavy focus on simulation and sim-to-real.
This is 1,024 nuts and bolts being
simulated in real time on a single GPU.
Every contact is actually being simulated.
We followed this by introducing robots into the
scene — a vibratory feeder mechanism vibrating
at 60 Hz. These hex nuts are channeled,
the robot grasps one and places it onto
the bolt. We can do hundreds of thousands of
these environments in parallel in real time.
Because we can do this so fast and accurately, we
can use it for reinforcement learning training,
leveraging algorithms that are simple,
efficient, and powerful like PPO.
We brought this to the real world: assembling
gears with policies learned purely in simulation.
We didn’t just do this for gears — we
extended it to multi-part assemblies
and even tactile sensing.
We trained neural networks on this data and
learned policies in a learned simulator called
Neural Robot Dynamics, or NeRD.
It performs walking in simulation
and transfers zero-shot to the real world,
assembling NVIDIA’s GB300 superchips,
managing rigid bodies and cables.
Next, we’re applying
NeRD to industrial assembly and scientific
lab automation — aiming for autonomous robot
scientists and bringing robotics into the
home to handle real-world unpredictability.
[applause]
Amazing. We are going to follow
up with some questions and then
we are going to play a game.
You’ll never guess what it is.
All right. Yejin, is continual
learning of models the future?
Yes. So I told you about RLP, which is a form
of explorative learning during pre-training,
but we can definitely expand that spirit
even for deployment time, during test time,
by bringing training into deployment time. So
that, I envision, will be one way to really
push the frontiers of AI — by allowing the AI
model to mix what happens during training and
what happens during testing. In spirit, it’s
analogous to what I presented today, which is
to mix some of the stages of training
between the pre-training portion and the
reinforcement learning with human
feedback portion. Going forward,
I envision we will see even more hybridization of
different stages of AI training and deployment.
Marco, you mentioned that Alpamayo
is using a reasoning model.
What does that mean and how does that help?
Yeah, that’s a great question. So at a high
level, reasoning can be viewed as the process
of taking a complex problem and decomposing it
into smaller, more manageable problems, and then
planning an action step by step through this
decomposition. By now we have conclusive evidence
that forcing a model to carry out this process
leads to better, more performant actions.
It gives you an introspection signal, a
glimpse into what the model is “thinking”
before planning an action. It also provides
a signal that can be used for safety
purposes — for example, to understand whether the
model is certain about what it is doing or not. So
it’s a technology that achieves multiple purposes,
and we’re advancing it along multiple fronts.
When I read this Alpamayo paper, I felt like
previous non-reasoning self-driving systems were
a bit like sitting next to a teenager who’s
driving. [laughter] You know — smashes the
gas pedal and you’re like, “Whoa, why did
you do that?” And they say, “I don’t know,
hormones, something.” [laughter] We’re not
there yet. Yeah. And this actually reasons.
But I found out it’s not like you
just say to the AI model, “Reason,“
because during the research you had this problem:
it says things and then it does things. It says,
“If there’s going to be a red
light, I’m going to stop,“
and then there is a red light and it just
goes through. So you needed something — this
reasoning–action consistency step — to fix
this. Can you tell me a bit about that?
Yeah. Essentially, the model is producing
multiple output modalities. It produces
reasoning traces — explanations in plain English
about what the model would like to do — and it
produces actions — trajectories in physical
space. We want to make sure that reasoning
and action generation are reflective of each
other. For example, I like to make an analogy:
I like playing tennis, but I’m not
a particularly good tennis player.
Sometimes my brain has a sophisticated
strategy about where I want to send the ball,
but my body does something completely
different. That’s embodiment misalignment.
We want reasoning and action generation to
be reflective of each other so that action
generation fully harnesses the power of reasoning,
and reasoning faithfully represents the action
being generated so that it can be used as
a safety signal. To do this, we introduce
an additional training step — a post-training
alignment — where we build an explicit coupling
between the two output modalities, bringing them
together so they are very reflective of each
other. Through extensive experimentation, we’ve
seen this allows the model to produce reasoning
traces that are very faithful, while giving
you actions that are even more performant.
All right, Sanja, are world models the way to go?
Yeah, great question. I’ve been at NVIDIA
for seven years, a long time, so I’ve seen
an evolution of these simulation engines.
You can definitely see the upper bound of
every type of simulation. For example, when I
joined, they were using these graphics engines.
If you wanted to test AV software in,
say, a new intersection in San Francisco,
you might wait one or two months
for artists to create it, and then
you could go and test. There was a very low
ceiling for that approach to make an impact.
Then neural radiance fields came out around 2020.
Basically the next day we decided, “Okay,
that’s the next step we need to take,“
so you could take a real-world recording and
convert it into a simulation environment.
Even with this pragmatic approach, there are
limitations. Imagine in the original recording
the car stops a few meters away from a pedestrian,
and the pedestrian just walks across the street.
If the policy later in simulation stops ten
centimeters away, the pedestrian’s behavior
will look very different — they might run, they
might flip a finger, whatever. That behavior is
very different from the original recording
and not possible with that type of approach.
World models learn how to simulate from data. As
soon as you’ve seen such situations in the real
world — and you do — that’s basically the ceiling.
We’ve seen with LLMs that the data-driven approach
can really move mountains. So I definitely
think it’s going to be the future — maybe
not exactly as designed today, because we need to
simulate more than just cameras. There are many
sensors on robots; we also need to generate
forces so it feels like the real world. But
this learning-based approach is the
simulation engine of the future.
And just to emphasize how remarkable world models
are: you used to write handcrafted simulations,
where you programmed every single grandma on the
street, every traffic light — every “whatever”
is one paper. And now we have one model that
does everything, which is absolutely amazing.
Now, Yash, how does NeRD — Neural
Robot Dynamics — teach robots
complex physics, and how is it possible
they learn in their own imagination?
Yeah, this is a great question. First of all,
Károly, thank you for featuring NeRD.
I think it was your thousandth episode.
I’ve even seen your first episode, so I’ve been
there for the whole journey — it was an honor.
How does NeRD actually learn? As
I briefly mentioned in that video,
we take robot models in ground-truth
simulators and apply random actions to them,
generating a very diverse dataset. We then train
networks to learn from these datasets, with
several strategies to do this effectively. One
key strategy is representing the physics of the
world in a robot-centric frame. The physics when I
stand here is the same as when I stand over there,
and that invariance is a key trick
to make predictions generalize.
There are a number of other tricks and
tools on the learning side — using the right
network architecture, the right normalizations,
collecting the right data, and sampling it
properly during training. That’s the essence:
get a diverse dataset and train a network on it.
For policy learning, we use
it like any other simulator.
This neural network takes in the same simulation
inputs a classical simulator does — robot state,
joint torques, contact information — and predicts
the next state. From the simulator’s perspective,
it acts like any other solver, so you can treat
it that way and learn policies in imagination.
All right, let’s play a game. You’ll
never guess what it’s going to be.
Okay, something really cool is going to happen.
I call it “Hold On to Your Papers.” [laughter]
I will show you papers, and I’m hoping
the panelists can guess what they are.
If you know it, please don’t shout it out
immediately, but if you know the answer,
just raise your hand so I can see that.
Okay, let’s start with something easy.
Raise your hand if you know what
this is. Come on, hands. Hands.
Okay. Some fellow scholars here. All right.
Who knows this one? Anyone? [laughter]
It’s cute. It’s cute. Okay.
So, which paper is this?
Transformers.
Transformer. Excellent. Very good.
After the warm-up round. Who knows what this is?
I want to see some hands.
Very, very good. Excellent.
Some hands. Okay. Any tips for this?
There’s a hand over there, man. [laughter]
Okay. Who? Who was it?
Yeah.
NeRF.
There we go. NeRFs. [laughter]
There we go. Excellent.
Come see me after the talk — you’re
going to get the badge of honor.
Hands. Hands. Okay. Nice.
Who knows what this one is?
Oh, that’s obvious. [laughter]
Okay. Yes?
Can I say Alpamayo-1?
Yeah.
There we go. There we go. [laughter]
Okay. I’ve got a couple more. I wasn’t sure
if anyone would guess this, but let’s try.
Anyone knows what this is? A couple of years back.
I’ll give you a hint. This is
Flamingo. Does it ring a bell?
Yep, gentleman.
Yeah, this is Flamingo. [laughter]
There we go. Good job. Good job.
Okay, I’ve got one for a very specific
person. This is for a very specific person.
Say it.
I know.
Say it.
3D-Gaussians…
3D-Gaussians.
3D-Gaussians.
There we go. [laughter]
Okay. We made a joke last night. Can you explain
it, please? Thank you, Károly. Can
you explain briefly what this is?
Yeah. So, it’s the name of a method. Last year
we had a new method that does reconstruction
and also models secondary lighting effects
— it does ray tracing essentially — and we
named it 3D-Gaussians. So everyone
was making fun of me on stage,
and I guess it’s going to happen
again this year. Thank you.
So they called it “3D-G-U,” and I was like,
“Come on, man, that’s 3D-Gaussians — it’s got to
be.” [laughter]
Have you featured all of
those papers on your channel?
I have a last one. This is
for the real fellow scholars,
for the audience. Does anyone
know what paper this is? Anyone?
Oh my. Oh my.
Okay, I’m going to tell you this one. This
is called Wavelet Turbulence, and this
is one of the best papers ever
written. It won an Oscar award.
All right, Yejin. How do synthetic
datasets help training today’s AI models,
and can they measurably help iron out their
blind spots — things they don’t do so well on?
Oh yeah. Synthetic data is used so much these
days, especially to overcome data shortage.
Internet data not only isn’t growing fast enough,
but it doesn’t cover all the corner, long-tail
situations that our AI should handle.
To overcome that, sometimes we pay human
workers to curate data, but these days it’s
powerful to rely on AI to synthesize data.
An overarching theme from this panel has been
reasoning. Reasoning needs to connect the dots
across different pieces of knowledge,
often doing so for the first time.
Whenever the model generates reasoning traces
that are genuinely new — brand-new reasoning
traces that weren’t there before —
we’re essentially synthesizing new
reasoning tokens that can then be used
to strengthen AI models even further.
Marco, what’s next in this space?
Well, particularly in reasoning models, I
think we’re still scratching the surface
of what’s possible. For example, in the video
I showed earlier, I showed reasoning traces
as textual explanations in plain
English. But as humans, we reason
with multiple modalities. We reason in visual
concepts and more latent representations. A very
active area of research is going beyond textual
representations of reasoning to other modalities.
I don’t see these modalities as being in
contradiction, but rather ask how we can synergize
them to get a system that is more performant
and more efficient at inference. For example,
one power of reasoning, from my point
of view, is planning in semantic spaces
in a counterfactual way — trying to understand
what might happen under different assumptions,
basically asking “what if” questions that
humans ask all the time. It’s important to
empower machines to do the same. How to do this
in a performant and efficient way is still very
open, and I could go on for an hour listing
research questions in this very active area.
Amazing. Sanja, do humanoids need a different
simulation environment than self-driving cars?
Can you have one unified
environment that is good for both?
Yeah, I’m tempted to say we need one simulation
model in the future. The reason is that we’re
going to have multiple robots in the same world.
Maybe today there are two separate ones — a
humanoid in the kitchen and a car on the road
— but in the end, we’ll have multiple robots
designed for different tasks but living in the
same world. It doesn’t seem plausible to have
a separate simulator per robot, because you need
to simulate one coherent world for everyone.
So the future is kind of clear — that’s where
we need to get to. It’s partly there today,
but there are differences. For example, if you
look at GR00T-Dreams — a simulator released a
few months ago by Jim Fan — and Alpamayo
Dreams, they share the same technology.
This is this autoregressive, causal model based
on COSMOS that generates frames based on action,
and maybe the action is represented a bit
differently, but the core tech is the same.
However, humanoids are all about interaction
— you need to go to the kitchen and touch a
bunch of stuff; that’s how the robot learns. In
AV, if you touch something, that’s game over.
So the type of interactions you need to simulate
is very different. Robotics right now heavily
leverages physics-based simulators as the core,
while AV is more about visual fidelity, diversity,
and real-world scale. But in the future, we’ll
likely converge toward one engine for all robots.
Yash, what are the big unsolved
problems in neural simulation?
The big unsolved problems. I think the key one
— I’ll give a concrete example. We’ve done this
NeRD framework and paper, and now we’re looking at
how to extend this to very high degree-of-freedom
systems. Humanoids are a great example. I think
each of our hands is around 27 degrees of freedom;
we have two hands, plus arms and legs — this
could be an 80-degree-of-freedom system.
Currently, we generate data for NeRD models
by applying random actions to the robot.
As you can imagine, in an 80-degree-of-freedom
action space, this doesn’t scale. You really
need more meaningful and intelligent exploration.
This is a real opportunity — especially for those
from the RL community — to look at intrinsic
objectives, intrinsic motivations like diversity,
curiosity, and empowerment, to learn how to
generate data in a meaningful subspace of
the gigantic state and action space. I
think that’s the number one challenge.
The second challenge: if you have a neural
simulator trained on simulation data — pre-trained
on simulation — how do you fine-tune it on
real-world data? In simulation, your inputs might
be joint torques and contacts. In the real world,
you might get joint torques from your robot,
hopefully, but contacts? You need to perceive the
world and somehow extract contact information,
or you need a robust training strategy that
can avoid reliance on that data and instead
train on heterogeneous mixtures from sim and
real. That’s where we need to take it next.
All right, if we can, I’d like some questions
from you fellow scholars. Can we do that, Maggie?
Okay, let’s see if the situation improves,
and I’m going to… Yeah, but I don’t know
if there’s going to be a microphone or
not. Okay, just shout, that’s fine too.
All right, I’ll shout. [laughter]
I have so many questions about it later.
[laughter] Do you think it’s going to be
more along the lines of stuff that all happens
in context, or just updating… what do you think?
Yeah, I can repeat the question. The
question is: with continual learning,
does it all happen in context — basically
making prompt engineering more informed
by what happens during inference time
— or do we update the model weights?
I think both are on the table. In fact,
I’ve done research in both flavors.
The benefit of updating weights is that you get
much more focused specialization of the model,
really zoning in on one hard task that seemed
impossible. We have a recent paper called
TTT-Discover that performs lightweight
RL on top of an already trained model.
When you use an open-source model, it might not
handle certain tasks well enough, but when we
train the model to learn from its own
past mistakes during inference time, we
can unlock hidden capabilities that were there but
not expressed. There’s amazing future potential
here — exploring ways to do model updates during
test time. But one thorny thing is that you
can’t update model weights for everyone all the
time, so there are practical decisions to make.
All right, if someone promises me a super quick
question and answer — who has a super quick…
What is the future of self-improving AI?
So I can take this.
Yeah, go ahead.
The future of self-improving AI — I think that is
the future. I am very excited about AI scientists:
AI systems that can discover new knowledge
for us. Current-day AI, especially with LLMs,
tends to interpolate known knowledge or
compose familiar pieces of knowledge,
whereas we really want extrapolation,
beyond what human knowledge has
discovered. To get there,
we need self-improving AI.
There’s a lot of potential coming from
coding LLM agents and other types of agents.
In the future, in embodied space, we’re going
to see many more exciting results as well.
All right. What an amazing panel. Sorry, we have
to go. Unfortunately, I want to stay longer,
but they’re going to throw me out. What an
amazing panel. This was stunning — the creators of
some of the most amazing technologies of our
lifetimes. What a time to be alive. [laughter]
All right, thank you so much to all of you
for coming. Let’s give it up to Károly.
[applause]
Thank you. Amazing. [cheering]