Can robotics take off like GenAI? Moravec's paradox vs. scaling laws
Arguments in the literature for and against rapid progress in robotic learning research.
This week on The Retort, Tom and I discuss AI pioneers getting their feathers ruffled about alchemy, AI Safety, and RLHF; listen here.
There has been really great progress in robotics research in the last few years, heavily impacted by machine learning. All signs point towards that continuing even as ChatGPT shifted the focus of so many toward LLMs.
This post was inspired by three things: 1) my consistent level of being impressed by the Google Brain robotics work, 2) another high-profile success of deep RL (drone racing), and 3) my complete bafflement with the level of investment in humanoid robots like Tesla Optimus.
Multiple high-profile researchers I’ve connected with recently, who have witnessed multiple rises and falls of robotics, feel like this wave is different. The protection from the hype that the shadow of the LLM tsunami offers is an added edge in continuity.
This article combines three things that are worth knowing about:
The work in scaling large ML models for robotics, which is highlighted by Google Brain’s awesome work with language-guided robots. I’ve discussed this earlier this year too — this article discusses a bigger picture.
Recent, consistent breakthroughs in RL and robotics for control (especially low-level control).
Moravec’s paradox (observation that reasoning is easier than sensorimotor control) and the wall that’ll prevent these two points from combining and robotics taking off.
Looking through everything, it seems pretty simple: robotics progress is going great, but we are very far from figuring out the right form factors to sell these as products. Many people and media focus on consumer robots, which is less likely for many reasons.
Ultimately, if this is right and if in the near future scaling laws from data enable a takeoff in robotics -- a field steeped in arguments that progress is harder and deeply held back compared to digital domains -- we should expect AI to tackle much harder challenges than we can even imagine in 10 or 20 years.
Scaling large models for robotics
I was a skeptic about this “robotics breakthrough incoming” for a while, thinking there wasn’t enough momentum for making changes in robotics that regular people actually feel, but I’m coming around. If we zoom out and look at the last few years, it is clear that the research community is starting to understand what makes control problems tick. My favorite example of consistency in delivering high-impact is the Google Brain robotics organization under Vincent Vanhoucke.
Here’s a timeline of the advancements from my friend Ted Xiao, who has seemingly been there every step of the way. Here you see the following papers referenced:
📜 [Nov. 2018] QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
Expands tasks, likely enabling the data flywheel to become much more mature. Also, big advancements in the maturity of offline RL and compute capacity in the time being.
📜 [Apr. 2021] MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale
Then, they moved away from Offline RL to behavior cloning, at the same time as papers like Gato (scaled behavior cloning) and Decision Transformer (early experiments in offline RL + transformers + scaling) started to come out. Offline RL on this modality was not yet proven.
📜 [Feb. 2022]
BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning,
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Then with a much better language model, they got the scale of this in the real world to approximately 10x.
📜 [Dec. 2022] RT-1: Robotics Transformer for Real-World Control at Scale
Then, you add vision. If you compare the two system diagrams, the easiest way to see the difference is the change to no longer need the FiLM EfficientNet and the TokenLearning that correspond to specific robotics data. They specifically fine-tune a large visual language model (VLM) which can take the images and command in directly as context (after processing by a vision transformer, ViT).
📜 [Jul. 2023] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
It's clear that this process of scaling is getting into its groove. The data engine is paying off, the scope of the tasks is expanding, and more.
There's even been another paper related to this line since the figure was made a month or so ago, expanding on the Q-Learning methods presented in the past to use transformers as a policy: Q-Transformer. Given this is all from Google, I see the methods converging in some ways. It's got an awesome diagram on the website explaining the high level: that it combines human demonstrations with autonomous data -- a core principle of offline RL is this balance.
The final question with the Google work is: where does this take us? Cleaning up an office space is an awesome achievement, but what line of research unlocks way more in and out of this domain? I'd like to see it keep the general approach while solving historically tricky problems from a control perspective. This is where the other half of the argument comes in: doesn't what we've seen in RL mean we also can get robots to solve arbitrary focused tasks?
There’s lots more work in this area on a weekly basis, such as this large dataset, which may be a big step forward as the cumulative amount of available data grows.
Zooming out -- progress in scaling control
Robotics is one type of control problem. If the flywheel kicks in and scaling data is all we need, then the dream is to have agents for every low-level control problem.
There are plenty of other projects worth knowing about in that vein, and the next ones are coming at a regular cadence. Last year we had fusion reactor control and quadruped locomotion, this year we have drone racing and more multi-robot controllers, before that, we have simulated games, and next is almost surely going to be something entirely different. The characteristics of working in control is both the strength and weakness of the field. Looking at the above problems, they all have different limiting factors on engineering deployment. A quick summary:
Google robotics: data quality, scale, pipelines…
DeepMind fusion control: sim2real accuracy, few real-world trials…
Games: complex action spaces, easy dynamics…
Drone racing: extremely high-frequency control…
This list is meant to illustrate the differences because it shows how many factors the experts are focusing on. When compared to something like general-use LLMs, most metrics look similar across applications. The diversity encourages more development, but it makes it harder to follow and predict concrete trends across robotics and AI.
As a general observer of AI and not robotics, this may seem pretty similar to progress in AI and LLMs recently. What’s different? Why is robotics hard?
Historically, this has been referred to as Moravec’s paradox, which argues that the intelligence of motion is fundamentally hard to engineer when compared to reasoning and thinking. This is reinforced by how we got ChatGPT before we got in-home robots. The core of Moravec's paradox when making predictions about the future is that it is based on observation, not theory. We have a lot of work to do to figure out why.
It’s important to understand the scope of the paradox as robots gain access to language models. Language models will give the robots basic reasoning like object relations and directions, but they will not tell a robot how to manipulate motor voltages to arrange a set of actuators. Moravec’s paradox states that the latter task is hard.
If we want robots to do a lot of new tasks, which is the only way consumers are truly blown away, they need to be able to adapt and solve in new environments. Fields of work, like my Ph.D. in reinforcement learning, are posited to solve this, but we are far far away.
The diverse breakthroughs we’ve seen above are the result of specializing in one task or domain. This will still be useful for many parts of our lives, but I don’t see literature pointing to robots suddenly becoming all-capable. It’ll be an awesome set of slow burns like autonomous vehicles, where robots get better and better at specific things.
There will be some scaling laws for robotics, when you collect more data on a specific task you can expect certain improvements. The problem is that the general idea of emergent behavior feels very different. Control seems like a progressively harder problem, where getting the most subtle movements out requires drastically more computation and insight. In language, getting to another level of abilities is rearranging the same tokens, but with more insight.
The emergence I'm most interested in with robotics is the idea that a robot trained on one set of tasks can do something entirely different in a new environment. Environment transfer, for now, seems way harder than even changing languages with your LLM. Environment transfer feels almost like changing the tokens in your vocabulary and expecting the scaling laws to hold.
I hope I am wrong, and we see progress take off, but it should be a much slower integration (which is okay).
Data, robotics, and accelerating progress
Robotics and control-based agents don't have all the luxuries of language: abundant pretraining data, digital-only domains, extremely structured, etc. I still believe we'll start to see breakthroughs through common engineering approaches: accumulate the biggest and cleanest datasets, scale up models with the right inductive bias, and repeat. The only problem is that you can't really scrape datasets for robotics, you tend to need to build them.
The way that the flywheel starts to spin for robotics ML data is for robots to start to get beachheads in society. We have vacuums, we have self-driving cars, we have manipulation arms. We need more and we need the data to be accessible by a central entity. If this doesn’t happen, I don’t see another way for the spark to signal a step change in robotics work from an AI point of view.
I would love to see a link between YouTube videos, world understanding, and control, but that seems a bit too far out. Adept AI, who recently released an 8B parameter open-source LLM, is seemingly training a model on things like YouTube tutorials (the model release has unused code components of visual layers). This would be one step in that direction, but there has not yet been a strong demonstration of understanding transferring to low-level control — it transfers to basic reasoning, like what should I try and do, not how do I do it.
Robotics seems to be the final frontier of the bitter lesson and how more data can defeat all other blockers - I’m very curious how it ends. If data can end up being the trend that defeats Moravec's paradox, there are very few things I wouldn't bet on AI solving. For that reason, following progress in robotic learning will always be a great litmus test during the great phase shift we are experiencing.
Pre-paywall: Thanks to Ted Xiao and
for useful conversations that helped bootstrap this post over the last year (I’ve wanted to get back into robotics for a while)Humanoid robot investment confusion
I’ve been confused for a long time with the likes of Tesla’s Optimus, 1X Robotics, Agility Robotics, etc.’s seemingly totally unfounded hype on the humanoid form factor.
To be clear, I don’t think Tesla will make a mass-market robot available within 15 years, and I don’t think anyone would want to buy one for many years after that.