Discussion about this post

User's avatar
Christo Wilken's avatar

Thank you for the insightful post!

Could you perhaps explain why “If doing RL with verifiable rewards (RLVR) on an instruct model, the KL penalty could still be helpful.”?

Expand full comment
1 more comment...

No posts