Nvidias New Ai Broke My Brain
read summary →TITLE: NVIDIA’s New AI Broke My Brain
CHANNEL: Two Minute Papers
DATE: 2026-04-25
---TRANSCRIPT---
Let’s see what is going on here. This is me
around 9am. A bit wobbly, steps are unsure,
yup, that checks out. Now then, give me my
fake badge. Thank you sir. Hehehe, no one
noticed. Now let’s proceed to the next step of my
mastermind plans. Let’s eat all their food. Wait,
they noticed. Proceed to the next
step. What was that? Oh yes, run!
Now, jokes aside, look at that. Sign up
for this one baby. Oh yes, please mow
my lawn. That is excellent. Rake the leaves!
Perfect. Hey, don’t slack off, that’s my job!
Okay, so what is going on here. Let’s start with
the good news, this is a new teleoperated robot
controller and more. They call it Sonic. Now
the work here is not the robot, but the software
controlling it. At least in this footage, watch
until the end and you might get surprised. This
means there is a human performing these movements,
and the robot is able to understand these motions,
and then translate them to a bunch of joint
positions in 3D space. It’s kind of insane
that this is possible. But it will just get
better and better as we continue the video.
So, before you ask, yes it can do kung
fu. Provided that you can do kung fu. It
understands whole body movement, so you can get it
to crawl into some space you don’t want to go to.
And that is super useful, people are already
using robots for that. Why? Well, chiefly,
for exploring under explored and dangerous
areas. This means tons of useful applications,
for instance, a variant of this could
help save humans stuck under rubble,
or perhaps later, even explore other
planets without putting humans at risk.
But that’s still nothing. Because this is a
multimodal system. Meaning that the input can be
almost anything. So, you say that I don’t have to
pretend to mow the lawn to actually mow the lawn,
because where is the fun in that? Well, just
tell it to do that. Can you? Well, currently,
for simpler tasks, like moving around or behaving
like a monkey, yes you can! Absolutely incredible.
And I love how expressive it is.
You can ask it to walk happily,
stealthily, or like an injured person.
And you know, just the fact that it is stable
and does not fall is remarkable. Previously,
even in simple characters in simulated worlds, you
needed thousands and thousands of tries to teach
them to just be able to walk without falling.
And now, this, is a huge leap forward. Wow.
But it gets better, we said multimodal.
Yup, that means that the input can also
be music. I’ll show you the dancing, but
not the music because of Youtube reasons,
but I put a link in the description
where you can check it out.
And we haven’t even talked about the most insane
part of the whole thing. Now hold on to your
papers Fellow Scholars, because this runs with
about 42 million parameters. That is a neural
network so simple, it can run so easily on your
phone it barely notices it. It may even run on
your toaster these days. That size is absolutely
nothing. This is an incredible achivement.
Okay, but how? How is that even possible? Dear
Fellow Scholars, this is Two Minute Papers with
Dr. Károly Zsolnai-Fehér. Well, first, it
looked at 100 million frames of human motion
to understand what we do and how we do it. The
incredible thing is that this system does not
require human-made action labels, so we don’t have
to explain our movements. It just watches the raw
motions and figures out how to transition
between tasks without any unnatural pauses!
So then, your multi-modal input goes in, a video
of you, your voice, music, or just text. A motion
generator turns these into human motion, and the
human encoder processes it into a latent space,
and then a quantizer converts it to universal
tokens. Once again, universal tokens, that is key,
you’ll see a bit later. Then, the decoder
translates these tokens into motor commands.
But there is a big problem. Learning to convert
one to the other is super hard. First of all,
robots do not work like humans, that
is one of the fundamental challenges.
So if the user commands you to turn around, it
should be turning around. Okay, sure. But how
fast exactly? You don’t want to try to turn 180
degrees too quickly, because you would fall apart.
To solve this, in their research paper, they
propose what they call a root trajectory
spring model. This dampens sudden, quick user
commands so the robot does not get injured.
Yes, robots can get injured
too, which is kind of hilarious.
Now there is an exponential term as a function of
time. What is that? That is a physical brake. As
time increases, this term rapidly shrinks to 0,
which forces the whole mathematical expression to
decay smoothly. This serves two goals: one,
the robot does not injure itself and two,
it will settle at a target position without
oscillating back and forth forever. Nice.
Now, do the dampening too much, and
of course, you’ll get a little slug
that can’t get anything done, so it’s
really tough to do well. Well done folks.
Now, all this took 128 GPUs and 3 days to
train. That is expensive. But here’s the key,
after the training is done, the final product
is so lightweight, we don’t need this kind of
hardware to run it at all. In fact, all
of the models showcased in these videos
will be given to all of us for free, forever.
They run on your phone, easy-peasy. That is
incredible. Open research for the benefit
of humanity. Love it, thank you so much.
This project is led by professor Zhu and Jim Fan,
who I love dearly. Jim started the humanoid
robots lab at NVIDIA just 2 years ago,
and they are raining research papers on us,
breakthrough after breakthrough. Insanity.
And to compress all this human movement
knowledge down into a tiny little AI
controller that can be used by any of
us is simply a stunning achievement.
It turns out, training a good AI requires coding
good thinking into a machine. But, surprisingly,
we ourselves can also learn a lot of good
life advice from this kind of thinking too.
For instance, the model compresses a messy,
diverse soup of inputs into a kind of pure,
abstract token. You know, in life,
when asking other people for advice,
you will inevitably hear everything,
and its opposite too. That is also a
big soup of inputs. But try to look at all of
them, side by side, and you’ll find that they
often share an underlying truth. This works,
as is showcased by this incredible project too.
And note that this work is not the end of
anything, this is just a start. An early
work at a nascent area. Two more papers
down the line, and I really hope this is
going to start folding my laundry and cooking
my lunch. That would be amazing. What a time to
be alive! And this is not some proprietary
nonsense, this is open knowledge and open
just dropped. If you are interested in hearing
more hopefully soon, subscribe and hit the bell.