LLM agents follow-up: exploration, RLHF, and more
How does autonomy of language models relate to data collection.
This article heavily follows Wednesday's post on LLM Agents. Again, more than two-thirds of this post is free, just with a conclusion under the paywall.
At some point, people will get a grip on how large language models (LLMs) interact with users and environments to generate new data. We are very far from that case. Right now, some feedback systems exist using large language models (LLMs), but the data flow and understanding are extremely lacking. The information flow primarily is tailored by people controlling behavior via prompts and other interconnects of LLMs with data sources (e.g. search engines or vector databases) not an understanding of feedback.
As the goals of these models become broader, the models will need to be let off the leash to see if they can solve more open-ended problems. The vision here, for LLM agents, has to be that the agent eventually can even try things that you're not sure it can do.
The drive for open-ended operation, in the search for ultimate capabilities, is implicit in the risks I view for these models. It is the countervailing force balancing the integration dead ends. The researchers will inevitably take the gloves off.
AI-safety arguments aside, so we assume some basic things need to be off-limits for the model (restricted to certain sites, constraints on the speed of action via built-in lag, and more), there is a huge open space here. In my article on hallucinations, where I began thinking about the potential risks of playing in this space:
Coming at LLMs from a robotics and decision-making background makes me feel duty-bound to have a distinction between hallucinations and exploration in data collection. I don’t. Hallucinating in relatively narrow-scoped sampling domains like robotic control seems beneficial and very similar to exploration methods. This does not necessarily scale when the actions are taken on the internet scale, or the mouse and keyboard input scale. If you give the model full access to the normal input-output tools humans use with computers, you really don't want them pressing random combinations of keys. This intuitively feels closer to the factual answers issue with hallucination, but not quite. I'm wondering if this intuitive gap is just because training practices on language data, rather than action data, are so much better defined.
On the other hand, factual errors also can just seem similar to when a planning-based policy selects the wrong action. The model probably has all the information to know that Obama was inaugurated in 2009, but sometimes it'll sample wrong. In some domains, trying new things gives you insight into valuable new data. Interacting with users obsessed with information retrieval is not that type. I suspect that hallucinations are not talked about as much in real-world decision-making domains because the failures will be looped into other types of analysis: Was the value prediction wrong, were the predicted trajectories wrong, etc? Hallucinations in grounded domains may be easier to analyze than language!
The phase that I am at is coming up with those axes for failures in LLM agents. To do so effectively, we need to have a clearer conversation about the applications that LLMs will actually be useful for. With this, we need to be able to describe the ways that LLM agents can fail and the impacts of those failures. This two-horse dance may have been characterizable if models were static -- ie predicting if failure modes will occur across applications, which I think they will -- but as even the most popular models are changed without version control, we're flying in head first.
This will quickly become another area where the taxonomy used in the technical feed-driving techniques people are using is quickly divorced from the general discussion at hand. Exploration in the RL field is a fairly well-understood term. In the context of language models, exploration is almost never discussed. These systems were released to the world in a form where the scope of their use is defined narrowly, so changing that is intellectually tricky for most users to track.
In this vein, I'm not so worried about the technical constraints needed to do this, it just seems like enabling LLMs to take action is going to charge ahead into an even stranger discourse around these technologies.
Exploration and RLHF
As reinforcement learning from human feedback (RLHF) is nearly the table stakes these days for making a compelling chatbot, it's surprising that it isn't discussed more -- exploration is one of the core topics in the field of RL. Exploration is the idea of how an agent gathers new data to improve its policy. In RLHF, this exploration seems to be primarily handled in the training process, where the policy changes the data it encounters and how it will generate in the future. In everything I've seen, while the exploration vs. exploitation tradeoff can never be removed from the training of an RL algorithm (there are core hyperparameters that dictate that), the evaluations and training runs I've seen do not take the exploration framing.
At its core, exploration is required to gather new data to solve a task. With how RLHF is done today, which is primarily about extracting preference data from an aggregate reward model, this makes sense. RL in RLHF has a much more narrow framing than the intellectual foundations of RL would let on. RL research on exploration is around creating intrinsic rewards, optimizing information gain, skill learning / diversity, and more.
I suspect this gap is changing in experiments, where exploration is actively going to be used for completing tasks in addition to human preference labels. Then, giving your LLM the ability to act and optimize the full loop, seems crucial.