Using RL's exploitation to debug
A musing on how I think autonomous system companies should use RL.
Update (26 Oct.): I’ve added a new section to the end of this post expanding on related research and work done in industry thanks to Eugene Vinitsky, Matt Fontaine, and Prithviraj (Raj) Ammanabrolu.
Update^2 (31 Oct.): I’ve heard from a couple companies who are actually doing this! Sadly, I haven’t convinced them to let me discuss it publicly, yet.
Many autonomous system companies have very advanced simulators of their world with their existing control stack embedded1. The simulators are used for all sorts of things: evaluating current controllers on pre-defined scenarios (like test cases based on real-world data), synthetic data generation, and experimenting with new controllers.
They are missing out by not deploying a simple tool on top of this: reinforcement learning (RL) trained to find failure modes. I'm calling this adversarial verification. It is actually deeply coupled to the normal concept of adversarial attacks in ML -- it's just using RL to tune the loss function.
RL is known for its ability to exploit an environment -- manipulating the structured dynamics in order to improve a marginal reward function. This is often not intended when you are trying to solicit an RL agent to solve a new task. For a new task, the RL designer is mapping from an imagined behavior in their mind to a specific rendition in the form of a reward function. Exploitation in this domain often comes from a mismatch in the resulting behavior relative to the intended specification of a reward function.
This underlying characteristic of RL -- being a strong enough optimizer in complex domains that leads to exploitation -- does not only need to be related to misspecification. Exploitation is defined as "the action or fact of treating [something] unfairly in order to benefit from their work." This exploitation can be used to positive effect in situations where the RL agent treats an existing controller unfairly, say adversarially, in order to understand the controller's shortcomings. Finding shortcomings like this in complex and highly engineered control systems is extremely valuable -- there are whole teams of engineers working to figure them out! In my mind, RL is the right tool to complement these engineers.
The problem formulation
It is well known that mistakes by autonomous vehicles are costly because they share space with human pedestrians. For robotics, the cost of mistakes is more economic than social because a mistake means the robot has to stop while a human intervenes (physically or via teleoperation)2. The goal of most engineers is to minimize these mistakes while continuing to improve the baseline performance.
In my short experience as an engineer, it's often harder to find these failure modes because they come from something you would've never thought of.
Generally, when someone thinks of an intelligent agent, we think of something like a robot and how it views the world from the first-person. Actions are moving motors, states are reading the world from sensors. There is only a small change needed to how RL agents are viewed to set this up. An RL agent is normally defined with "the usual": Markov Decision Process (MDP), action space, state space, transition function, etc. Thankfully, we don't even need to dig into that to get the idea across; that'll get boring.
As RL is applied to more abstract domains, this first-person view is only one possible way to train an agent. The agents control any variable that can be directly changed with inputs.
In the case of adversarial verification with RL, the actions are defining the domain and the states are how the robot reacts. It's interesting in this case because the reward functions can actually be shared. If a robot had a reward for throughput and minimizing failures, all the verification RL agents would need to get going is a negative sign in front of it.
There are actually many methods of research related to this idea. There are methods for environment design to generate intelligence (rather than failures) and methods for exploration where a teacher defines challenges for the learning agent.
Configuring the domain in robotics
The specific domain I had in mind to try this first for robotics is something like a pick-and-place pipeline. Robotics companies that work in factory automation have a slew of task names for rotating objects, moving them between belts, un/loading palettes, and more. We can choose any one of these.
Consider a robot with a pile of objects to organize onto a moving conveyor belt. This robot will pick objects up by choosing a grasp orientation and a final destination (a space on the belt), and usually, a pre-defined controller will output a trajectory in between the two. The adversarial RL agent will be tasked with defining the pile of objects to be picked up and the incoming conveyor belt speeds. Objects to be moved can have different mass distributions, shapes, and orientations that challenge the grasping algorithm.
The found failure modes can be any of the following: weird object shapes that the robot can’t pick up at all (not that interesting), objects that are likely to be dropped because the grip is too weak, objects that seem like they’ll land on the conveyor belt but will fall off, objects that make it impossible to find a grasp orientation on the input pile, and many more.
I don’t know exactly what the agent will come up with! In self-driving I am actually a little afraid to ideate on the challenging circumstances and outcomes (does that mean it is even more needed though — if humans don’t want to do it, then computes are desperately needed).
I'm more confident than usual that this will work. There are surely going to be some tricky spin-up problems in this case. Given that there are so many different objects that could be encountered, how do you populate a distribution of new possible ones that the RL agent can sample from? This will take some creativity in methods and strict engineering to just see what works. I see this mostly as startup costs that happen in every applied RL task, but it’s important to caveat any RL proposal with how obviously hard it still is to do a lot of this!
Why hasn't this come out of academia?
This idea fits directly into the growing divide between industry and academic research. Lots of the most well-known research successes in the last few years have come from industry labs because they have a more direct capability to scale their training paradigms (and the Transformer is known to be successful because it is so good at scaling with data and computing).
From my perspective, all of the simulators that are advanced enough to do this kind of work exist at the largest applied ML companies. All of the companies in this list that I interviewed with bragged about their simulators in the interview process (Waymo, Cruise, Tesla, Dexterity, Boston Dynamics, ... etc). It's obvious that it’s a really useful tool!
I actually think most of these companies recognized this idea I was selling them as making sense, it just didn't seem like something they knew how to hire more for. If you want to give it a go, please reach out, I'm curious about how it works.
Some related work
There are some deeply related lines of work for this in academia, but not quite in the same flavor I indicated. Also, they all go by different naming conventions so it is a bit hard to find (thanks Twitter fam.). Some of the most relevant ones are as follows:
In industry: Modl.ai that uses “self updating bots” to enhance automated quality assurance (primarily for video games). There is a separate article to be written about the challenges of using RL agents for the gaming industry, which I will write eventually. Related is this paper from Electronic Arts working on RL for procedural content generation.
In autonomous vehicles (and industrial processes): Some work from Mykel Kochenderfer’s group at Stanford on verification and autonomous vehicles. The naming conventions here are remarkably similar: Autonomous Attack Mitigation of industrial processes or Adaptive Stress Testing to identify failure modes in autonomous vehicles.
In robotics: Finally, there is some work on “Training Robots to Evaluate Robots” that is creating policies to prod at specific parts of the control architecture. This seems a lot more modular than my open-ended application, but is worth mentioning. Additionally, a field exists called Automatic Scenario Generation! Matt F. pointed out a bunch of papers on Twitter! Seems really exciting for when there are humans in the loop.
In algorithmic environment generation: There is exciting work in both environment design to promote intelligent behavior, like a curriculum, (such as from my friend Michael, PAIRED) and policies that are robust to adversarially generated environments (Adversarial Environment Generation)!
Next time!
I came up with this idea when interviewing as a candidate for robotics and autonomous vehicle companies.
Will obviously change when robots enter the real world more. This will be another decade or so after self-driving (at least for it to be a similar scale of impact!).
Thanks for the great post, Nathan! Using RL to find the failure cases in robotics is definitely an interesting perspective.
This is one of my work in generating adversarial environment for robotic applications: https://arxiv.org/abs/2107.06353. One interesting aspect of it is that it does not assume a parameterization of the environment (many related work in RL settings such as maze navigation consider parameterized environments). Instead we find the adversarial environments from a generative model trained on an existing dataset. This allows us, for example, to find adversarial grasping object with arbitrary shapes. I can see future work in treating the generative model as a RL agent for generating adversarial environments.
I think one aspect worth considering, which you touched upon briefly also, is whether adversarial cases are truly useful for training better robotic policies. I think often a better coverage of the possible cases (i.e., covering rare cases) instead of just finding the harder cases, could be more useful, in settings like autonomous driving. Maybe there is some way to achieve both coverage in rare and hard cases (e.g., combining domain randomization and adversarial training).