Transcript: Nvidias New Ai Broke My Brain

TITLE: NVIDIA’s New AI Broke My Brain CHANNEL: Two Minute Papers DATE: 2026-04-25 ---TRANSCRIPT--- Let’s see what is going on here. This is me around 9am. A bit wobbly, steps are unsure,
yup, that checks out. Now then, give me my fake badge. Thank you sir. Hehehe, no one
noticed. Now let’s proceed to the next step of my mastermind plans. Let’s eat all their food. Wait,
they noticed. Proceed to the next step. What was that? Oh yes, run! Now, jokes aside, look at that. Sign up for this one baby. Oh yes, please mow
my lawn. That is excellent. Rake the leaves! Perfect. Hey, don’t slack off, that’s my job! Okay, so what is going on here. Let’s start with the good news, this is a new teleoperated robot
controller and more. They call it Sonic. Now the work here is not the robot, but the software
controlling it. At least in this footage, watch until the end and you might get surprised. This
means there is a human performing these movements, and the robot is able to understand these motions,
and then translate them to a bunch of joint positions in 3D space. It’s kind of insane
that this is possible. But it will just get better and better as we continue the video. So, before you ask, yes it can do kung fu. Provided that you can do kung fu. It
understands whole body movement, so you can get it to crawl into some space you don’t want to go to.
And that is super useful, people are already using robots for that. Why? Well, chiefly,
for exploring under explored and dangerous areas. This means tons of useful applications,
for instance, a variant of this could help save humans stuck under rubble,
or perhaps later, even explore other planets without putting humans at risk. But that’s still nothing. Because this is a multimodal system. Meaning that the input can be
almost anything. So, you say that I don’t have to pretend to mow the lawn to actually mow the lawn,
because where is the fun in that? Well, just tell it to do that. Can you? Well, currently,
for simpler tasks, like moving around or behaving like a monkey, yes you can! Absolutely incredible. And I love how expressive it is. You can ask it to walk happily,
stealthily, or like an injured person. And you know, just the fact that it is stable and does not fall is remarkable. Previously,
even in simple characters in simulated worlds, you needed thousands and thousands of tries to teach
them to just be able to walk without falling. And now, this, is a huge leap forward. Wow. But it gets better, we said multimodal. Yup, that means that the input can also
be music. I’ll show you the dancing, but not the music because of Youtube reasons,
but I put a link in the description where you can check it out. And we haven’t even talked about the most insane part of the whole thing. Now hold on to your
papers Fellow Scholars, because this runs with about 42 million parameters. That is a neural
network so simple, it can run so easily on your phone it barely notices it. It may even run on
your toaster these days. That size is absolutely nothing. This is an incredible achivement. Okay, but how? How is that even possible? Dear Fellow Scholars, this is Two Minute Papers with
Dr. Károly Zsolnai-Fehér. Well, first, it looked at 100 million frames of human motion
to understand what we do and how we do it. The incredible thing is that this system does not
require human-made action labels, so we don’t have to explain our movements. It just watches the raw
motions and figures out how to transition between tasks without any unnatural pauses! So then, your multi-modal input goes in, a video of you, your voice, music, or just text. A motion
generator turns these into human motion, and the human encoder processes it into a latent space,
and then a quantizer converts it to universal tokens. Once again, universal tokens, that is key,
you’ll see a bit later. Then, the decoder translates these tokens into motor commands. But there is a big problem. Learning to convert one to the other is super hard. First of all,
robots do not work like humans, that is one of the fundamental challenges. So if the user commands you to turn around, it should be turning around. Okay, sure. But how
fast exactly? You don’t want to try to turn 180 degrees too quickly, because you would fall apart. To solve this, in their research paper, they propose what they call a root trajectory
spring model. This dampens sudden, quick user commands so the robot does not get injured.
Yes, robots can get injured too, which is kind of hilarious. Now there is an exponential term as a function of time. What is that? That is a physical brake. As
time increases, this term rapidly shrinks to 0, which forces the whole mathematical expression to
decay smoothly. This serves two goals: one, the robot does not injure itself and two,
it will settle at a target position without oscillating back and forth forever. Nice. Now, do the dampening too much, and of course, you’ll get a little slug
that can’t get anything done, so it’s really tough to do well. Well done folks. Now, all this took 128 GPUs and 3 days to train. That is expensive. But here’s the key,
after the training is done, the final product is so lightweight, we don’t need this kind of
hardware to run it at all. In fact, all of the models showcased in these videos
will be given to all of us for free, forever. They run on your phone, easy-peasy. That is
incredible. Open research for the benefit of humanity. Love it, thank you so much. This project is led by professor Zhu and Jim Fan,
who I love dearly. Jim started the humanoid robots lab at NVIDIA just 2 years ago,
and they are raining research papers on us, breakthrough after breakthrough. Insanity. And to compress all this human movement knowledge down into a tiny little AI
controller that can be used by any of us is simply a stunning achievement. It turns out, training a good AI requires coding good thinking into a machine. But, surprisingly,
we ourselves can also learn a lot of good life advice from this kind of thinking too. For instance, the model compresses a messy, diverse soup of inputs into a kind of pure,
abstract token. You know, in life, when asking other people for advice,
you will inevitably hear everything, and its opposite too. That is also a
big soup of inputs. But try to look at all of them, side by side, and you’ll find that they
often share an underlying truth. This works, as is showcased by this incredible project too. And note that this work is not the end of anything, this is just a start. An early
work at a nascent area. Two more papers down the line, and I really hope this is
going to start folding my laundry and cooking my lunch. That would be amazing. What a time to
be alive! And this is not some proprietary nonsense, this is open knowledge and open
just dropped. If you are interested in hearing more hopefully soon, subscribe and hit the bell.