What comes next with reinforcement learning

Nathan Lambert

Jun 9

Scaling RL, sparse rewards, continual learning, and the progress wall when pretraining really stops.

Read →

2 Comments

juluc

Jun 9

Hi Nathan, great read, few Qs:

When scaling from 10K to 1M+ token episodes, are labs experimenting with any form of "semantic coherence tracking" during training? It seems like with such sparse rewards, you'd want some signal that the model's reasoning is staying conceptually consistent vs. just wandering until it stumbles on a solution (connects to your last blog post)?

Or is this why they're decomposing into subtasks - because tracking/rewarding coherence across million-token trajectories would be computationally way too much (esp. compared to just training on more samples?)

If its not compute:

The fact that current methods are 'helping models get more robust at individual tasks' rather than true end-to-end learning suggests maybe the real bottleneck isn't just sparse rewards but probs maintaining meaningful learning signals across such long contexts - or do they want users to build their own scaffolds, optimizing models for benchmark performance.

Expand full comment

Honestly, I don't know exactly what they're doing. The SemiAnalysis post has more hypotheticals. I think there's necessarily going to be more intermediate supervision because human use of the model will go through intermediate parts in other scenarios. We'll see though (this comment didn't make the most sense as I wrote it).

It'll be telling which labs want others to build agents off their models (platform), but I bet especially OpenAI wants to own the end to end experience.

Expand full comment

Reply

Share