heading · body

Transcript

Ai Research Breakthroughs From Nvidia Research

read summary →

Dear fellow scholars, [laughter] thank you so much for coming today. I’ve got to hold on to my papers
for this one. So, I got an interesting message a few weeks ago. Someone asked, “Károly, we have four
legendary scientists, and we want to make a round table discussion for them. Do you know anyone
who could hang with them?” And I said, “No, [laughter] nobody can.” Now, before they drag
me off the stage, I’ll get to learn from them about their incredible works and how they will
enhance our lives. Don’t forget that a lot of what you will hear today is technology that is free for
all of us forever, which is absolutely amazing. And I have some questions for them to reveal how
all their works are connected to build a better world for us. Now, don’t worry, I won’t be long.
Let’s see. Oh yes, this is a good one. This is a good one. All right. Kidding. Kidding. So, first she is a Senior Director of AI Research and Professor at Stanford
and University of Washington. She won four prestigious Best Paper Awards just last year.
Please give her a big hand. This is Professor Yejin Choi. [music] [applause] All right, we have a situation. NVIDIA GPUs are getting ever faster and better,
but you guys are not posting social media fast enough to keep up with the data demand.
So, Ilasker said at his Test of Time Award talk less than two years ago that this is a situation.
I’m excited to share what we’re doing at NVIDIA Research in order to cope with that data shortage.
But let’s look at how LLM sausages are made just to motivate what we do. So it starts with pre-training on the entirety of internet data, followed by sequential fine-tuning
on lots of curated exam-style data. Those two together form a kind of imitation learning — you
only imitate what’s in the data. But we’ve now all heard about RLHF — reinforcement learning with
human feedback — which allows the model to explore for itself instead of just copying imitation.
That really opens a new era where the model begins to learn to explore. However, there are two questions. One is:
exploration only comes at the very end of this pipeline — is that how it always should be? And
can we somehow inject this reasoning earlier into the pipeline? So what we propose to do is RLP, or
Reinforcement Learning as Pre-training, a learning objective where we start with this context.
Usually during pre-training you just predict the next token, but instead what we do with RLP
is let the model think for itself even during pre-training before predicting the next token. When we do this, the results were super surprising and exciting. We measured this before and after
training, controlling the amount of compute, etc. All around, it’s looking very exciting. The future
will be the era of explorative learning, and we are very excited to push that through. Thank you. [applause] All right. Next up, we have the Director of Autonomous Vehicle Research. He’s working
on making software for self-driving cars a reality and giving it to all of us for free.
He also won the Presidential Early Career Award by President Obama. So, next up,
please welcome Professor Marco Pavone. [applause] Thank you. So, a core area of research for my group entails the development of
reasoning models for autonomous driving — an area
where we’ve been actively contributing to and sharing our work with the community. As you may know, at the latest CES, NVIDIA unveiled the Alpamayo open platform,
an open ecosystem of AI models, simulation tools, and datasets to accelerate the development of
reasoning-based autonomous driving solutions. The Alpamayo platform includes the Alpamayo-1 model,
a 10-billion-parameter chain-of-thought reasoning visual-language-action
model built on the COSMOS Reason visual-language foundation model;
Alpamayo-Sim, an end-to-end simulation framework to test fully autonomous EV stacks; and the
Physical AI EV dataset, one of the largest and most geographically diverse datasets available.
Everything is open — models, code, datasets — available on Hugging Face and on GitHub. Let me give you a glimpse into the Alpamayo-1 reasoning model,
its core principles, and development tools. The next frontier of autonomous driving is enabled by the power to reason. This means developing
models that can think through extremely rare or never-before-seen situations and act accordingly.
NVIDIA Alpamayo is helping make this happen today with an ecosystem of open components
that bring together AI models with reasoning capabilities to make decisions, closed-loop
simulation tools to test those decisions, and massive real-world driving datasets to learn from. Alpamayo-1, the first release in this family, is a large
vision-language-action model that integrates visual perception, language understanding,
and action generation with reasoning to explain its own decisions. Alpamayo-1 can more accurately
perceive the environment, interpret context, and anticipate risks through reasoning.
These explicit reasoning traces ensure logical consistency and trustworthy explainability. Since its release, the Alpamayo platform has seen
incredible adoption by customers. This week at GTC, we are announcing several
extensions and upgrades. First, the release of Alpamayo-1.5 — still a 10-billion-parameter model,
but now offering navigation and text-prompt guidance, making it more flexibly steerable. We are also releasing post-training scripts based on customer demand — for example,
supervised fine-tuning scripts and post-training alignment scripts — to allow developers to
customize the model to their own datasets. We are expanding the scenes that you can test
your AV stack against within Alpamayo-Sim and releasing reasoning labels and the
Channel-of-Thought auto-labeling pipeline, laying the foundation to further research in this field. To learn more, please refer to the Face blog post I released yesterday.
Please note that all links will become active by the end of this week. Thank you. [applause] All right, our next guest is VP of AI Research and Head of the Toronto AI Lab. She is a master
of simulation research. She has a bunch of “Professor of the Year” and Best Paper Awards, and
more faculty awards than I thought ever existed. Please give it up for Professor Sanja Fidler. [music] Hi everyone, and thanks, Károly. Marco already touched upon some key elements of AV development,
and here we’re going to focus on simulation, which is a critical tool to test the AV software before
it hits the road. Traditionally this was based on graphics — basically like graphics-based game
engines for robots. The key limitation was that it relied on human-authored content,
and you just can’t author the entire world — it wasn’t scalable. Now, neural reconstruction like 3D Gaussian splats changed the game.
Now you can go from a real recording
to a reconstructed simulation environment to create new experiences for the policy. Vehicles’ real-world data is transformed into Gaussian-based 3D scenes using Omniverse Neural
Reconstruction, or NewRec — converting video into photorealistic digital environments.
The reconstructed scenes are simulated with variations
to assess the AI driver’s ability to generate safe trajectories. Physical AI isn’t just driving on roads — it’s operating in kitchens, offices, and warehouses.
Each of these robots needs to be simulated in virtual worlds reconstructed from multiple
sensors. With NewRec, we can rapidly render these simulations in real time with high fidelity. New capabilities like physics bring realistic interaction to 3D Gaussian scenes, letting
robots engage naturally with virtual environments. Generative AI uses the visual properties in these
simulations to add diversity, scaling a single scene into many. Neural reconstruction is already
advancing the world’s physical AI. And now, with just a text prompt, NewRec and COSMOS can generate
3D simulation environments that refine testing and validation for the next generation of robotics. This stack was integrated into the internal production stack for
simulation. It’s running two million tests per day, providing value for AV development. So, the question is, what’s next? Black screen is next. We’ve entered the era of generative simulation. The idea here is training a world model on massive
amounts of visual data to basically learn how to simulate — it’s purely bound by data.
This approach can generate completely novel, challenging scenarios — like rain, snow,
or a mattress falling onto a car. Last year at GTC, NVIDIA announced COSMOS — thanks to Mingu somewhere in the audience. It was the
first generative simulation of its kind. At that time, it took several minutes to generate only a
few seconds of video. Today, we’re announcing Alpamayo Dreams — real-time, interactive,
and the first user to test it is our first employee, really our favorite NVIDIAan — Jensen. Alpamayo Dreams simulates multiple cameras in real time,
reacting to the policy. They run in closed-loop and are super easy
to edit. You can change the weather, add objects, and test the policy in different conditions.
We have a demo at the booth — please join me after this talk! [applause] How cool is that? And finally, he is leading the Simulation and Behavior Generation team.
He simulated one thousand simultaneous nut-and-bolt interactions in real time
on a single GPU. Please welcome Dr. Yashraj Narang. All right, let’s talk about robotics research. I’m going to take you on a small journey
with a heavy focus on simulation and sim-to-real. This is 1,024 nuts and bolts being simulated in real time on a single GPU.
Every contact is actually being simulated.
We followed this by introducing robots into the scene — a vibratory feeder mechanism vibrating
at 60 Hz. These hex nuts are channeled, the robot grasps one and places it onto
the bolt. We can do hundreds of thousands of these environments in parallel in real time. Because we can do this so fast and accurately, we can use it for reinforcement learning training,
leveraging algorithms that are simple, efficient, and powerful like PPO. We brought this to the real world: assembling gears with policies learned purely in simulation.
We didn’t just do this for gears — we extended it to multi-part assemblies
and even tactile sensing. We trained neural networks on this data and learned policies in a learned simulator called
Neural Robot Dynamics, or NeRD. It performs walking in simulation
and transfers zero-shot to the real world, assembling NVIDIA’s GB300 superchips,
managing rigid bodies and cables. Next, we’re applying
NeRD to industrial assembly and scientific lab automation — aiming for autonomous robot
scientists and bringing robotics into the home to handle real-world unpredictability. [applause] Amazing. We are going to follow up with some questions and then
we are going to play a game. You’ll never guess what it is. All right. Yejin, is continual learning of models the future? Yes. So I told you about RLP, which is a form of explorative learning during pre-training,
but we can definitely expand that spirit even for deployment time, during test time,
by bringing training into deployment time. So that, I envision, will be one way to really
push the frontiers of AI — by allowing the AI model to mix what happens during training and
what happens during testing. In spirit, it’s analogous to what I presented today, which is
to mix some of the stages of training between the pre-training portion and the
reinforcement learning with human feedback portion. Going forward,
I envision we will see even more hybridization of different stages of AI training and deployment. Marco, you mentioned that Alpamayo is using a reasoning model.
What does that mean and how does that help? Yeah, that’s a great question. So at a high level, reasoning can be viewed as the process
of taking a complex problem and decomposing it into smaller, more manageable problems, and then
planning an action step by step through this decomposition. By now we have conclusive evidence
that forcing a model to carry out this process leads to better, more performant actions.
It gives you an introspection signal, a glimpse into what the model is “thinking”
before planning an action. It also provides a signal that can be used for safety
purposes — for example, to understand whether the model is certain about what it is doing or not. So
it’s a technology that achieves multiple purposes, and we’re advancing it along multiple fronts. When I read this Alpamayo paper, I felt like previous non-reasoning self-driving systems were
a bit like sitting next to a teenager who’s driving. [laughter] You know — smashes the
gas pedal and you’re like, “Whoa, why did you do that?” And they say, “I don’t know,
hormones, something.” [laughter] We’re not there yet. Yeah. And this actually reasons. But I found out it’s not like you just say to the AI model, “Reason,“
because during the research you had this problem: it says things and then it does things. It says,
“If there’s going to be a red light, I’m going to stop,“
and then there is a red light and it just goes through. So you needed something — this
reasoning–action consistency step — to fix this. Can you tell me a bit about that? Yeah. Essentially, the model is producing multiple output modalities. It produces
reasoning traces — explanations in plain English about what the model would like to do — and it
produces actions — trajectories in physical space. We want to make sure that reasoning
and action generation are reflective of each other. For example, I like to make an analogy:
I like playing tennis, but I’m not a particularly good tennis player.
Sometimes my brain has a sophisticated strategy about where I want to send the ball,
but my body does something completely different. That’s embodiment misalignment. We want reasoning and action generation to be reflective of each other so that action
generation fully harnesses the power of reasoning, and reasoning faithfully represents the action
being generated so that it can be used as a safety signal. To do this, we introduce
an additional training step — a post-training alignment — where we build an explicit coupling
between the two output modalities, bringing them together so they are very reflective of each
other. Through extensive experimentation, we’ve seen this allows the model to produce reasoning
traces that are very faithful, while giving you actions that are even more performant. All right, Sanja, are world models the way to go? Yeah, great question. I’ve been at NVIDIA for seven years, a long time, so I’ve seen
an evolution of these simulation engines. You can definitely see the upper bound of
every type of simulation. For example, when I joined, they were using these graphics engines.
If you wanted to test AV software in, say, a new intersection in San Francisco,
you might wait one or two months for artists to create it, and then
you could go and test. There was a very low ceiling for that approach to make an impact. Then neural radiance fields came out around 2020.
Basically the next day we decided, “Okay, that’s the next step we need to take,“
so you could take a real-world recording and convert it into a simulation environment.
Even with this pragmatic approach, there are limitations. Imagine in the original recording
the car stops a few meters away from a pedestrian, and the pedestrian just walks across the street.
If the policy later in simulation stops ten centimeters away, the pedestrian’s behavior
will look very different — they might run, they might flip a finger, whatever. That behavior is
very different from the original recording and not possible with that type of approach. World models learn how to simulate from data. As soon as you’ve seen such situations in the real
world — and you do — that’s basically the ceiling. We’ve seen with LLMs that the data-driven approach
can really move mountains. So I definitely think it’s going to be the future — maybe
not exactly as designed today, because we need to simulate more than just cameras. There are many
sensors on robots; we also need to generate forces so it feels like the real world. But
this learning-based approach is the simulation engine of the future. And just to emphasize how remarkable world models are: you used to write handcrafted simulations,
where you programmed every single grandma on the street, every traffic light — every “whatever”
is one paper. And now we have one model that does everything, which is absolutely amazing. Now, Yash, how does NeRD — Neural Robot Dynamics — teach robots
complex physics, and how is it possible they learn in their own imagination? Yeah, this is a great question. First of all,
Károly, thank you for featuring NeRD. I think it was your thousandth episode.
I’ve even seen your first episode, so I’ve been there for the whole journey — it was an honor. How does NeRD actually learn? As I briefly mentioned in that video,
we take robot models in ground-truth simulators and apply random actions to them,
generating a very diverse dataset. We then train networks to learn from these datasets, with
several strategies to do this effectively. One key strategy is representing the physics of the
world in a robot-centric frame. The physics when I stand here is the same as when I stand over there,
and that invariance is a key trick to make predictions generalize. There are a number of other tricks and tools on the learning side — using the right
network architecture, the right normalizations, collecting the right data, and sampling it
properly during training. That’s the essence: get a diverse dataset and train a network on it. For policy learning, we use it like any other simulator.
This neural network takes in the same simulation inputs a classical simulator does — robot state,
joint torques, contact information — and predicts the next state. From the simulator’s perspective,
it acts like any other solver, so you can treat it that way and learn policies in imagination. All right, let’s play a game. You’ll never guess what it’s going to be.
Okay, something really cool is going to happen. I call it “Hold On to Your Papers.” [laughter]
I will show you papers, and I’m hoping the panelists can guess what they are.
If you know it, please don’t shout it out immediately, but if you know the answer,
just raise your hand so I can see that. Okay, let’s start with something easy. Raise your hand if you know what this is. Come on, hands. Hands.
Okay. Some fellow scholars here. All right. Who knows this one? Anyone? [laughter] It’s cute. It’s cute. Okay. So, which paper is this? Transformers. Transformer. Excellent. Very good. After the warm-up round. Who knows what this is? I want to see some hands. Very, very good. Excellent.
Some hands. Okay. Any tips for this? There’s a hand over there, man. [laughter] Okay. Who? Who was it? Yeah. NeRF. There we go. NeRFs. [laughter] There we go. Excellent.
Come see me after the talk — you’re going to get the badge of honor. Hands. Hands. Okay. Nice. Who knows what this one is? Oh, that’s obvious. [laughter] Okay. Yes? Can I say Alpamayo-1? Yeah. There we go. There we go. [laughter] Okay. I’ve got a couple more. I wasn’t sure if anyone would guess this, but let’s try.
Anyone knows what this is? A couple of years back. I’ll give you a hint. This is Flamingo. Does it ring a bell? Yep, gentleman. Yeah, this is Flamingo. [laughter] There we go. Good job. Good job. Okay, I’ve got one for a very specific person. This is for a very specific person. Say it. I know. Say it. 3D-Gaussians… 3D-Gaussians. 3D-Gaussians. There we go. [laughter] Okay. We made a joke last night. Can you explain
it, please? Thank you, Károly. Can you explain briefly what this is? Yeah. So, it’s the name of a method. Last year we had a new method that does reconstruction
and also models secondary lighting effects — it does ray tracing essentially — and we
named it 3D-Gaussians. So everyone was making fun of me on stage,
and I guess it’s going to happen again this year. Thank you. So they called it “3D-G-U,” and I was like, “Come on, man, that’s 3D-Gaussians — it’s got to
be.” [laughter] Have you featured all of those papers on your channel? I have a last one. This is for the real fellow scholars,
for the audience. Does anyone know what paper this is? Anyone? Oh my. Oh my. Okay, I’m going to tell you this one. This is called Wavelet Turbulence, and this
is one of the best papers ever written. It won an Oscar award. All right, Yejin. How do synthetic datasets help training today’s AI models,
and can they measurably help iron out their blind spots — things they don’t do so well on? Oh yeah. Synthetic data is used so much these days, especially to overcome data shortage.
Internet data not only isn’t growing fast enough, but it doesn’t cover all the corner, long-tail
situations that our AI should handle. To overcome that, sometimes we pay human
workers to curate data, but these days it’s powerful to rely on AI to synthesize data. An overarching theme from this panel has been reasoning. Reasoning needs to connect the dots
across different pieces of knowledge, often doing so for the first time.
Whenever the model generates reasoning traces that are genuinely new — brand-new reasoning
traces that weren’t there before — we’re essentially synthesizing new
reasoning tokens that can then be used to strengthen AI models even further. Marco, what’s next in this space? Well, particularly in reasoning models, I think we’re still scratching the surface
of what’s possible. For example, in the video I showed earlier, I showed reasoning traces
as textual explanations in plain English. But as humans, we reason
with multiple modalities. We reason in visual concepts and more latent representations. A very
active area of research is going beyond textual representations of reasoning to other modalities. I don’t see these modalities as being in contradiction, but rather ask how we can synergize
them to get a system that is more performant and more efficient at inference. For example,
one power of reasoning, from my point of view, is planning in semantic spaces
in a counterfactual way — trying to understand what might happen under different assumptions,
basically asking “what if” questions that humans ask all the time. It’s important to
empower machines to do the same. How to do this in a performant and efficient way is still very
open, and I could go on for an hour listing research questions in this very active area. Amazing. Sanja, do humanoids need a different simulation environment than self-driving cars?
Can you have one unified environment that is good for both? Yeah, I’m tempted to say we need one simulation model in the future. The reason is that we’re
going to have multiple robots in the same world. Maybe today there are two separate ones — a
humanoid in the kitchen and a car on the road — but in the end, we’ll have multiple robots
designed for different tasks but living in the same world. It doesn’t seem plausible to have
a separate simulator per robot, because you need to simulate one coherent world for everyone. So the future is kind of clear — that’s where we need to get to. It’s partly there today,
but there are differences. For example, if you look at GR00T-Dreams — a simulator released a
few months ago by Jim Fan — and Alpamayo Dreams, they share the same technology.
This is this autoregressive, causal model based on COSMOS that generates frames based on action,
and maybe the action is represented a bit differently, but the core tech is the same.
However, humanoids are all about interaction — you need to go to the kitchen and touch a
bunch of stuff; that’s how the robot learns. In AV, if you touch something, that’s game over. So the type of interactions you need to simulate is very different. Robotics right now heavily
leverages physics-based simulators as the core, while AV is more about visual fidelity, diversity,
and real-world scale. But in the future, we’ll likely converge toward one engine for all robots. Yash, what are the big unsolved problems in neural simulation? The big unsolved problems. I think the key one — I’ll give a concrete example. We’ve done this
NeRD framework and paper, and now we’re looking at how to extend this to very high degree-of-freedom
systems. Humanoids are a great example. I think each of our hands is around 27 degrees of freedom;
we have two hands, plus arms and legs — this could be an 80-degree-of-freedom system. Currently, we generate data for NeRD models by applying random actions to the robot.
As you can imagine, in an 80-degree-of-freedom action space, this doesn’t scale. You really
need more meaningful and intelligent exploration. This is a real opportunity — especially for those
from the RL community — to look at intrinsic objectives, intrinsic motivations like diversity,
curiosity, and empowerment, to learn how to generate data in a meaningful subspace of
the gigantic state and action space. I think that’s the number one challenge. The second challenge: if you have a neural simulator trained on simulation data — pre-trained
on simulation — how do you fine-tune it on real-world data? In simulation, your inputs might
be joint torques and contacts. In the real world, you might get joint torques from your robot,
hopefully, but contacts? You need to perceive the world and somehow extract contact information,
or you need a robust training strategy that can avoid reliance on that data and instead
train on heterogeneous mixtures from sim and real. That’s where we need to take it next. All right, if we can, I’d like some questions from you fellow scholars. Can we do that, Maggie? Okay, let’s see if the situation improves, and I’m going to… Yeah, but I don’t know
if there’s going to be a microphone or not. Okay, just shout, that’s fine too. All right, I’ll shout. [laughter] I have so many questions about it later. [laughter] Do you think it’s going to be
more along the lines of stuff that all happens in context, or just updating… what do you think? Yeah, I can repeat the question. The question is: with continual learning,
does it all happen in context — basically making prompt engineering more informed
by what happens during inference time — or do we update the model weights?
I think both are on the table. In fact, I’ve done research in both flavors. The benefit of updating weights is that you get much more focused specialization of the model,
really zoning in on one hard task that seemed impossible. We have a recent paper called
TTT-Discover that performs lightweight RL on top of an already trained model.
When you use an open-source model, it might not handle certain tasks well enough, but when we
train the model to learn from its own past mistakes during inference time, we
can unlock hidden capabilities that were there but not expressed. There’s amazing future potential
here — exploring ways to do model updates during test time. But one thorny thing is that you
can’t update model weights for everyone all the time, so there are practical decisions to make. All right, if someone promises me a super quick question and answer — who has a super quick… What is the future of self-improving AI? So I can take this. Yeah, go ahead. The future of self-improving AI — I think that is the future. I am very excited about AI scientists:
AI systems that can discover new knowledge for us. Current-day AI, especially with LLMs,
tends to interpolate known knowledge or compose familiar pieces of knowledge,
whereas we really want extrapolation, beyond what human knowledge has
discovered. To get there, we need self-improving AI. There’s a lot of potential coming from coding LLM agents and other types of agents.
In the future, in embodied space, we’re going to see many more exciting results as well. All right. What an amazing panel. Sorry, we have to go. Unfortunately, I want to stay longer,
but they’re going to throw me out. What an amazing panel. This was stunning — the creators of
some of the most amazing technologies of our lifetimes. What a time to be alive. [laughter] All right, thank you so much to all of you for coming. Let’s give it up to Károly. [applause] Thank you. Amazing. [cheering]