Discussion about this post

User's avatar
Ram  Komarraju's avatar

Nathan, IIRC, you're quite bullish about the long term prospects of RL as applied to LLMs. But what do you think of the findings from the paper published last week "Reasoning with Sampling: Your Base Model is Smarter Than You Think" which seems to further confirm the results of the pass@k paper from earlier in the year. Is the only advantage of RLVR is improved one-shot performance in exchange for the loss of diversity? And even this is only applicable to verifiable scenarios?

Thanks

Expand full comment
Rainbow Roxy's avatar

Thanks for writing this, it clarifies a lot. I totally agree that challenges in scaling RL, especially for academics, really mirror engineering headaches seen with MoE models. Your insight about predicting the learning curve is spot on; it’s such an important 'hill' to understand.

Expand full comment
1 more comment...

No posts