How does KTO (Kahneman-Tversky Optimization) factor in? It seems to me more adaptable to real world data and seems on the surface to be able to learn a more nuanced reward model.
I am waiting for more information. My sense is it'll be very useful for some tasks where getting just a thumbs up on an answer is obtained at scale. When I know, will definitely write about it. Feels like DPO before the first model was trained on it (Zephyr)
But yeah, I don't think they really result in a reward model, right? The same topics will apply, different inference mapping.
@Nathan - There is a paper that I recently read about using per token rewards - "Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss" May be you have looked at this. And, the other paper I am currently reading is the LiPO where we have ranked list of human preferred choices.
How does KTO (Kahneman-Tversky Optimization) factor in? It seems to me more adaptable to real world data and seems on the surface to be able to learn a more nuanced reward model.
I am waiting for more information. My sense is it'll be very useful for some tasks where getting just a thumbs up on an answer is obtained at scale. When I know, will definitely write about it. Feels like DPO before the first model was trained on it (Zephyr)
But yeah, I don't think they really result in a reward model, right? The same topics will apply, different inference mapping.
@Nathan - There is a paper that I recently read about using per token rewards - "Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss" May be you have looked at this. And, the other paper I am currently reading is the LiPO where we have ranked list of human preferred choices.
It’s on my list! There are a few papers in the area. I need to check out this one that does something similar. https://arxiv.org/abs/2402.00782