Why reward models are key for alignment

Feb 14, 2024

In an era dominated by direct preference optimization and LLM-as-a-judge, why do we still need a model to output only a scalar reward?

Read →

4 Comments

Matthew

Feb 23, 2024

How does KTO (Kahneman-Tversky Optimization) factor in? It seems to me more adaptable to real world data and seems on the surface to be able to learn a more nuanced reward model.

Expand full comment

Reply (1)

Nathan Lambert

Feb 23, 2024Edited

I am waiting for more information. My sense is it'll be very useful for some tasks where getting just a thumbs up on an answer is obtained at scale. When I know, will definitely write about it. Feels like DPO before the first model was trained on it (Zephyr)

But yeah, I don't think they really result in a reward model, right? The same topics will apply, different inference mapping.

Expand full comment

VJAnand

Feb 18, 2024

@Nathan - There is a paper that I recently read about using per token rewards - "Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss" May be you have looked at this. And, the other paper I am currently reading is the LiPO where we have ranked list of human preferred choices.

Expand full comment

Reply (1)

Nathan Lambert

Feb 18, 2024

It’s on my list! There are a few papers in the area. I need to check out this one that does something similar. https://arxiv.org/abs/2402.00782

Expand full comment