Recent reasoning research: GRPO tweaks, base…

Nathan Lambert

Mar 31

The papers I endorse as worth reading among a cresting wave of reasoning research.

Read →

2 Comments

Christo Wilken

Apr 1

Thank you for the insightful post!

Could you perhaps explain why “If doing RL with verifiable rewards (RLVR) on an instruct model, the KL penalty could still be helpful.”?

Expand full comment

Reply (1)

Nathan Lambert

Apr 1

Ah, essentially if you want to us RL to improve performance but not make the model have a super long CoT/ reasoning behavior, you prolly want the kl penalty for now. This is what we did for TÜLU 3. Though, I think it may go away with time (kl penalties)

Expand full comment