Do we need RL for RLHF?

Dec 6, 2023

Direct (DPO) vs. RL methods for preferences, more RLHF models, and hard truths in open RLHF work. We have more questions than answers.

Read →

3 Comments

Mark

Dec 6, 2023

I'm seeing a lot of attempts by startups to create enterprise products out of cutting-edge research like this. But they're not always closely scrutinized b/c they keep their work so close to their chest.

Write-ups like these are incredibly insightful, especially the question section. Really keeps you grounded and shows how convoluted breakthroughs can be.

Expand full comment

Teng Xiao

Dec 28, 2023

Hi Dr. Lambert, are you aware of any papers or research works that empirically demonstrate PPO's superiority over DPO in certain datasets or tasks?

Expand full comment

Reply (1)

Nathan Lambert

Dec 28, 2023

That's the problem, it's mostly behind closed doors of big companies. Hoping to improve it in the new year (people who expressed interested: AI2, Stanford, Nvidia, and anyone who wants to help)

Expand full comment