Ah, essentially if you want to us RL to improve performance but not make the model have a super long CoT/ reasoning behavior, you prolly want the kl penalty for now. This is what we did for TÜLU 3. Though, I think it may go away with time (kl penalties)
Thank you for the insightful post!
Could you perhaps explain why “If doing RL with verifiable rewards (RLVR) on an instruct model, the KL penalty could still be helpful.”?
Ah, essentially if you want to us RL to improve performance but not make the model have a super long CoT/ reasoning behavior, you prolly want the kl penalty for now. This is what we did for TÜLU 3. Though, I think it may go away with time (kl penalties)