2 Comments
User's avatar
Christo Wilken's avatar

Thank you for the insightful post!

Could you perhaps explain why “If doing RL with verifiable rewards (RLVR) on an instruct model, the KL penalty could still be helpful.”?

Expand full comment
Nathan Lambert's avatar

Ah, essentially if you want to us RL to improve performance but not make the model have a super long CoT/ reasoning behavior, you prolly want the kl penalty for now. This is what we did for TÜLU 3. Though, I think it may go away with time (kl penalties)

Expand full comment