Some trends are reasonable to extrapolate, some are not. Even for the trends we are succeeding at extrapolating, it is not clear how that signal translates into different AI behaviors.
When using the lens of "reliability" to think about LLM capabilities on long-form tasks (reasoning chains or agentic workflows)... if a given model has an error rate of (say) 1%, I can think of at least three possible explanations. I'm wondering whether you have any intuition as to which is more likely to be correct (or what the likely mix is).
A) Random noise: the model is often uncertain about what to output next, and sometimes gets a bad die roll.
B) Variance in subtask difficulty: 99% of the steps are relatively easy, 1% are difficult (require a creative leap / obscure knowledge / tricky reasoning) and the model isn't smart enough. (Part of the idea here is that there is a diversity of difficult tasks, and so the specific capabilities needed to accomplish them are diffuse.)
C) Correlated failures: 1% of the steps require some particular capability that the model does not possess. (Basically the same as B, except that the model keeps tripping over the same shortcoming instead of a variety of different shortcomings.)
This interests me because it seems to bear on the question of how capabilities will evolve. Naively, I would expect that (A) could be addressed by increasing either training or test compute. (Increased training == more certain answers, increased test == can produce N answers and choose the best.) (B) seems best addressed through increased training scale. (C) breaks down into a variety of scenarios; for a given category of task, the issue could be something that is easily addressed (e.g. through fine-tuning), something that is likely to be knocked off by another round of training scale... or a deep issue that simple scaling won't address.
(I think of o1 as showing that for a certain category of chain-of-thought task, models in the gpt-4 class were suffering from a category-C problem, and the additional training given to o1 addressed that problem. But I don't have much confidence in my understanding here.)
This is a different way, maybe more specific, to describe what I was thinking.
I don't really want to attribute any to random noise, as they're technically types of failures you mentioned below, related to training, but I suppose with temperature >0 sampling it is truly random.
At the end of the day, the math squishes them all together. You need mechanistic interpretability to disambiguate, yes?
Your point about o1 is something that's been on my mind, and I agree. You can get away with "less model" and avoid a previously common type of error.
If you're scaling Chinchilla-optimally, shouldn't this be quartic rather than quadratic? Since L(N, D) varies as about N^-0.5 and D^-0.5, with C = 6ND.
Thanks for shedding light on this.
When using the lens of "reliability" to think about LLM capabilities on long-form tasks (reasoning chains or agentic workflows)... if a given model has an error rate of (say) 1%, I can think of at least three possible explanations. I'm wondering whether you have any intuition as to which is more likely to be correct (or what the likely mix is).
A) Random noise: the model is often uncertain about what to output next, and sometimes gets a bad die roll.
B) Variance in subtask difficulty: 99% of the steps are relatively easy, 1% are difficult (require a creative leap / obscure knowledge / tricky reasoning) and the model isn't smart enough. (Part of the idea here is that there is a diversity of difficult tasks, and so the specific capabilities needed to accomplish them are diffuse.)
C) Correlated failures: 1% of the steps require some particular capability that the model does not possess. (Basically the same as B, except that the model keeps tripping over the same shortcoming instead of a variety of different shortcomings.)
This interests me because it seems to bear on the question of how capabilities will evolve. Naively, I would expect that (A) could be addressed by increasing either training or test compute. (Increased training == more certain answers, increased test == can produce N answers and choose the best.) (B) seems best addressed through increased training scale. (C) breaks down into a variety of scenarios; for a given category of task, the issue could be something that is easily addressed (e.g. through fine-tuning), something that is likely to be knocked off by another round of training scale... or a deep issue that simple scaling won't address.
(I think of o1 as showing that for a certain category of chain-of-thought task, models in the gpt-4 class were suffering from a category-C problem, and the additional training given to o1 addressed that problem. But I don't have much confidence in my understanding here.)
This is a different way, maybe more specific, to describe what I was thinking.
I don't really want to attribute any to random noise, as they're technically types of failures you mentioned below, related to training, but I suppose with temperature >0 sampling it is truly random.
At the end of the day, the math squishes them all together. You need mechanistic interpretability to disambiguate, yes?
Your point about o1 is something that's been on my mind, and I agree. You can get away with "less model" and avoid a previously common type of error.
> decreases with quadratic compute increase
If you're scaling Chinchilla-optimally, shouldn't this be quartic rather than quadratic? Since L(N, D) varies as about N^-0.5 and D^-0.5, with C = 6ND.
Yes, quartic-ish! Mostly was thinking shape and didn't have the right word in my head. I had bounced between a couple