4 Comments

How does KTO (Kahneman-Tversky Optimization) factor in? It seems to me more adaptable to real world data and seems on the surface to be able to learn a more nuanced reward model.

Expand full comment

@Nathan - There is a paper that I recently read about using per token rewards - "Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss" May be you have looked at this. And, the other paper I am currently reading is the LiPO where we have ranked list of human preferred choices.

Expand full comment