Discussion about this post

User's avatar
Steve Newman's avatar

Thanks for shedding light on this.

When using the lens of "reliability" to think about LLM capabilities on long-form tasks (reasoning chains or agentic workflows)... if a given model has an error rate of (say) 1%, I can think of at least three possible explanations. I'm wondering whether you have any intuition as to which is more likely to be correct (or what the likely mix is).

A) Random noise: the model is often uncertain about what to output next, and sometimes gets a bad die roll.

B) Variance in subtask difficulty: 99% of the steps are relatively easy, 1% are difficult (require a creative leap / obscure knowledge / tricky reasoning) and the model isn't smart enough. (Part of the idea here is that there is a diversity of difficult tasks, and so the specific capabilities needed to accomplish them are diffuse.)

C) Correlated failures: 1% of the steps require some particular capability that the model does not possess. (Basically the same as B, except that the model keeps tripping over the same shortcoming instead of a variety of different shortcomings.)

This interests me because it seems to bear on the question of how capabilities will evolve. Naively, I would expect that (A) could be addressed by increasing either training or test compute. (Increased training == more certain answers, increased test == can produce N answers and choose the best.) (B) seems best addressed through increased training scale. (C) breaks down into a variety of scenarios; for a given category of task, the issue could be something that is easily addressed (e.g. through fine-tuning), something that is likely to be knocked off by another round of training scale... or a deep issue that simple scaling won't address.

(I think of o1 as showing that for a certain category of chain-of-thought task, models in the gpt-4 class were suffering from a category-C problem, and the additional training given to o1 addressed that problem. But I don't have much confidence in my understanding here.)

Expand full comment
JS Denain's avatar

> decreases with quadratic compute increase

If you're scaling Chinchilla-optimally, shouldn't this be quartic rather than quadratic? Since L(N, D) varies as about N^-0.5 and D^-0.5, with C = 6ND.

Expand full comment
2 more comments...

No posts