Reasoning further pushes the ability for these models to generalize. Totally agree there. There’s still nothing here to help the LLM wander outside of its existing distribution/vector space though, right?
The distinction I’m making is one that’s more obvious in an RL agent. One can argue that it’s still highly dependent on your cost/reward shaping, but you can have an agent learn to handle entirely novel situations. In part that often is quite a simplistic “novelty”, depending on case, but it’s just ultimately formulated quite differently.
An LLM, even with CoT, can highly refine its answers into the right part of the vector space. It also has a huge knowledge base, which makes most queries quite good, especially if it can zone in into the right part of it. However, there is no ability to actually go “out of sample” entirely, just based on the conceptual way it operates. Is this understanding correct, or are you saying it generalizes beyond that?
Yeah I agree. I am very much thinking of core text domains as a measure of progress. There's so little real progress in agents that it's hard to forecast.
Great, makes sense. You wrote very clearly, but I wanted to explicitly check my understanding, especially since there is so much published these days that it’s hard to keep track.
Thanks for this post. I find myself often wondering this exact question of how far reasoners will go in other domains. I was not aware Claude Sonnet 3.5 was interpolating between shorter (no?) chains of thought and longer ones. If that works reliably, it seems like an obvious way to navigate the jagged frontier for maximum performance while controlling costs on non-reasoning required tasks. So obvious that I wonder if there is a flaw I am missing to explain why no one else does this.
I am also always wondering what role government research funding can play on leading edge problems. It seems like there are rich opportunities to develop other verifiers and provide evidence that LLM-as-a-judge works in particular domains.
Great post as always! I think the new paradigm could be extended in ways that surprise us. For example, could we mine web scale corpora for question-like text passages followed by answer-like text, and use such pairings with a judge reward and RL? Could we do image caption prediction for scientific images in encyclopedia this way? Humans are far more data efficient at learning from such data and RL might be a way to help close the gap.
Reasoning further pushes the ability for these models to generalize. Totally agree there. There’s still nothing here to help the LLM wander outside of its existing distribution/vector space though, right?
The distinction I’m making is one that’s more obvious in an RL agent. One can argue that it’s still highly dependent on your cost/reward shaping, but you can have an agent learn to handle entirely novel situations. In part that often is quite a simplistic “novelty”, depending on case, but it’s just ultimately formulated quite differently.
An LLM, even with CoT, can highly refine its answers into the right part of the vector space. It also has a huge knowledge base, which makes most queries quite good, especially if it can zone in into the right part of it. However, there is no ability to actually go “out of sample” entirely, just based on the conceptual way it operates. Is this understanding correct, or are you saying it generalizes beyond that?
Yeah I agree. I am very much thinking of core text domains as a measure of progress. There's so little real progress in agents that it's hard to forecast.
Great, makes sense. You wrote very clearly, but I wanted to explicitly check my understanding, especially since there is so much published these days that it’s hard to keep track.
its a big fat mess out here
Thanks for this post. I find myself often wondering this exact question of how far reasoners will go in other domains. I was not aware Claude Sonnet 3.5 was interpolating between shorter (no?) chains of thought and longer ones. If that works reliably, it seems like an obvious way to navigate the jagged frontier for maximum performance while controlling costs on non-reasoning required tasks. So obvious that I wonder if there is a flaw I am missing to explain why no one else does this.
I am also always wondering what role government research funding can play on leading edge problems. It seems like there are rich opportunities to develop other verifiers and provide evidence that LLM-as-a-judge works in particular domains.
Great post as always! I think the new paradigm could be extended in ways that surprise us. For example, could we mine web scale corpora for question-like text passages followed by answer-like text, and use such pairings with a judge reward and RL? Could we do image caption prediction for scientific images in encyclopedia this way? Humans are far more data efficient at learning from such data and RL might be a way to help close the gap.
Totally agree. The training regime is much more creative, will be fun to see.