RLHF learning resources in 2024

A list for beginners and wannabe experts and everyone in between.

Jan 12, 2024

I’ve given a lot of effort into sharing information on Reinforcement Learning from Human Feedback (RLHF). I figured I would categorize them in one place for people who come to me or Interconnects looking to learn about the topic.

This was inspired by my recent appearance on Latent Space, which we called RLHF 201. Doing this made me realize, once again, how few resources there are out there for going deeper on RLHF other than often confusing research papers. The slides for this talk are available here. Compared to my last lecture, I added a bunch of the underlying math, made figures cleaner, and added commentary on evaluation. The previous generation of slides I used at Stanford are also good, and they have a longer introduction.

Generally, the goal for this post is to give people with different learning styles the tools to learn more in their way of choice. I’ve split it up by video mediums (talks and podcasts), technical mediums (code and models or datasets), and text (which is mostly blog posts). Almost all of these link to papers within them, if you’re looking to go into more detail.

This list is obviously biased towards my stuff and is not a review, so plenty of things I’ve seen aren’t included.. It’s meant to give entry points for people wishing to go deeper on the subject. If you send me things that you think should be added and why, I’ll happily take a look.

Generally, I’ll give a very light description as to why I like every piece of content.

Video

Tutorials and overviews

December 2022, Reinforcement Learning from Human Feedback: From Zero to chatGPT. This was my first big lecture on RLHF. Included for posterity and remembering the excitement of the earliest days.
April 2023, John Schulman’s Berkeley RLHF talk, “RL and Truthfulness.” This is still the best talk on intuitions of why RLHF tuning on outputs from other models is risky from a capabilities point of view.
July 2023, My tutorial at ICML. A solid introduction with an hour on data from my colleague at Toloka AI.
July 2023, John Schulman’s Proxy Objectives in RLHF at ICML. This talk has some revealing details on the outer loop training they do for ChatGPT.

Research talks of mine

March 2023: Reinforcement Learning from Human Feedback: Open and Academic Perspectives (slides): A decent introduction with advice on how academics can work in this area.
August 2023: Objective Mismatch in Reinforcement Learning from Human Feedback (slides): understand the fundamental tradeoffs and sources of uncertainty in RLHF.
November 2023: Bridging RLHF from LLMs back to control (slides). Make connections on current RLHF progress back to other fields that have been using RL for much longer!
December 2023: 15min History of Reinforcement Learning and Human Feedback (slides). Answers: what are the core motivating fields of RLHF?
December 2023: Direct Preference Optimization (DPO): Easy to start, hard to master (maybe) (slides). Get enough info to know that we won’t have a DPO answer in 2024.
Videos from the New Orleans Alignment Workshop (before NeurIPs) has a bunch of appealing talks. Anca’s was specifically recommended to me.

Other podcasts

January 2023: TWIML Reinforcement Learning - RLHF, Robotic Pre-Training & Offline RL with Sergey Levine. This holds up really well when thinking about integrating long-term focuses of RL research into RLHF methods.
September 2023: Generating Conversation: RLHF and LLM Evaluations with Nathan Lambert (Episode 6). I still thought this was one of my better podcast appearances for the year of 2023!
January 2024: RLHF 201 - with Nathan Lambert of AI2 and Interconnects (video with slides here). In this, we discuss all the core topics in RLHF as we get ready for 2024. It was a great time to make this, and I think there’s a lot of details to learn from it if you have the basics.

Research

The iteratively updated list of papers I come across in the area is here (which I want to update soon). It’s the basis for this series, which I intend to continue.

I wrote two position / survey papers last fall covering what I expect to be the core themes unfolding in RLHF in the next few years. If you want a deeper take, I whole heartedly recommend them.

On reward models, the limitations of preferences, and more: The History and Risks of Reinforcement Learning and Human Feedback.
On the fundamental tradeoffs of different RLHF pieces: The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback.

There are two surveys of the area worth looking at too.

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback serves as a critique of the RLHF perspective from a mostly AI Safety angle and with a focus on LLM techniques.
A Survey of Reinforcement Learning from Human Feedback covers a much broader base than most of the paper’s I’ve linked. It’s important to remember that RLHF is much bigger than just LLMs.

The further reading section of my first primary blog post on RLHF is a good place to start with the classics of the field, with the likes of InstructGPT, Anthropic’s work, etc. It’s quoted here:

Fine-Tuning Language Models from Human Preferences (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
Learning to summarize with human feedback (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, Recursively Summarizing Books with Human Feedback (OpenAI Alignment Team 2021), follow on work summarizing books.
WebGPT: Browser-assisted question-answering with human feedback (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
InstructGPT: Training language models to follow instructions with human feedback (OpenAI Alignment Team 2022): RLHF applied to a general language model [Blog post on InstructGPT].
GopherCite: Teaching language models to support answers with verified quotes (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
Sparrow: Improving alignment of dialogue agents via targeted human judgements (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
ChatGPT: Optimizing Language Models for Dialogue (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
Scaling Laws for Reward Model Overoptimization (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.
Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization (Ramamurthy and Ammanabrolu et al. 2022): Discusses the design space of open-source tools in RLHF and proposes a new algorithm NLPO (Natural Language Policy Optimization) as an alternative to PPO.
Llama 2 (Touvron et al. 2023): Impactful open-access model with substantial RLHF details.

Code

There’s a lot of code out there for RLHF. Not all of it is that easy to work with or learn from. I worked on the first two.

Alignment handbook is probably the cleanest to start with and build off of from a researcher’s point of view.
TRL is the place that’s usually the fastest to implement minimal implementations of all the new algorithms. Lot’s of examples that can be run on single GPUs usually.
DeepSpeed Chat (paper is here). While very different engineering setup, it is good to compare different ways of implementing the same stuff.
TRLX, while kind of no longer supported, has some of the most in-detail logs on scaling algorithms like PPO.

Models

Obviously there are way too many to do a thorough study of, but the most important open RLHF models and datasets of the last year to me are:

Zephyr (led to Tulu 2, Stability’s model, Intel’s model, and more) was the spark that gave us the proliferation of DPO and generally useful RLHF models.
Starling was a recent model with great performance that intriguingly did not use DPO.
Llama 2 still has more details in their paper than most labs have tried with respect to RLHF.

Datasets

UltraFeedback: the dataset that gave us Zephyr et al. There’s even been more research trying to improve the dataset and RLHF performance.
Open Assistant 1: The community-generated instruction data that yielded the first wave of progress in open IFT training.
Alpaca: The first popular synthetic instruction data.
ShareGPT and variants: large datasets people are using to try and get ChatGPT-like abilities in open data.

Evaluations

These three evaluations are the comprehensive set of how RLHF models are relatively ranked.

ChatBotArena: The crowd-sourced comparisons website that is the go-to source of model quality for open and closed models alike.
MT Bench: A two turn chat evaluation also built by LMSYS, which is very well correlated with most real-world evaluations of LLMs.
AlpacaEval: The first GPT4-as-a-judge tool to proliferate LLM-as-a-judge practices.

Blog posts

Interconnects posts

From the 2023 year in review post:

Feb. 27: The RLHF battle lines are drawn covers the importance of RLHF to the LLM ecosystem, the costs of building it, and where the year will take us.
Apr. 26: Beyond human data: RLAIF needs a rebrand covers a new way of thinking about general RL fine-tuning of LLMs: RL from computational feedback (RLCF). RLAIF is a variant of this.
Jun. 21: How RLHF actually works covers the high-level intuition about what RLHF changes in model behavior -- safety, formatting, reasoning, and more subtle things.
Aug. 2: Specifying objectives in RLHF covers the proxy objective problem in RLHF and why the new method Direct Preference Optimization (DPO) may not be the final solution.
Oct. 18: Undoing RLHF and the brittleness of safe LLMs covers why RLHF safety filters are not resistant during further training and how this shifts the LLM marketplace.
Oct. 25: RLHF lit. review #1 and missing pieces in RLHF covers recent papers and core themes of RL research not yet touched by RLHF.
Nov. 22: RLHF progress: Scaling DPO to 70B, DPO vs PPO update, Tülu 2, Zephyr-β, meaningful evaluation, data contamination covers empirical progress in RLHF in the second half of 2024.
Dec. 6: Do we need RL for RLHF? covers all things DPO and what it means for RLHF in the future.

And this year:

What is missing to reproduce the RLHF of GPT4? The problems we likely won’t solve in Open RLHF this year.
Multimodal RLHF roundup: The questions you should try and answer if you want to work on multimodal chat models.

Other blogs of mine

Illustrating RLHF: The original post I learned the topic with, still a good introduction.
StackLLaMA: A hands-on guide to train LLaMA with RLHF: The full RLHF process on a specific dataset and domain.
Red-Teaming Large Language Models: A general introduction to red-teaming.
What Makes a Dialog Agent Useful?: A general introduction to the difference between chat agents and instruction models.

Other resources

Things like awesome-rlhf on GitHub have a lot of links, but they’re not curated.
The Assembly AI post How RLHF Preference Model Tuning Works (And How Things May Go Wrong) from Swyx.
Chip Huyen’s post on RLHF (multiple recommendation).
Karpathy’s “State of GPT” section about reward models (slides)
N implementation details of RLHF goes into the weeds trying to reproduce some of OpenAI’s original results in the area.

Please send me any other links you think deserve a chance to be included. I’m happy to keep updating this for a few weeks!