4 Comments
User's avatar
Steve Newman's avatar

Thanks for shedding light on this.

When using the lens of "reliability" to think about LLM capabilities on long-form tasks (reasoning chains or agentic workflows)... if a given model has an error rate of (say) 1%, I can think of at least three possible explanations. I'm wondering whether you have any intuition as to which is more likely to be correct (or what the likely mix is).

A) Random noise: the model is often uncertain about what to output next, and sometimes gets a bad die roll.

B) Variance in subtask difficulty: 99% of the steps are relatively easy, 1% are difficult (require a creative leap / obscure knowledge / tricky reasoning) and the model isn't smart enough. (Part of the idea here is that there is a diversity of difficult tasks, and so the specific capabilities needed to accomplish them are diffuse.)

C) Correlated failures: 1% of the steps require some particular capability that the model does not possess. (Basically the same as B, except that the model keeps tripping over the same shortcoming instead of a variety of different shortcomings.)

This interests me because it seems to bear on the question of how capabilities will evolve. Naively, I would expect that (A) could be addressed by increasing either training or test compute. (Increased training == more certain answers, increased test == can produce N answers and choose the best.) (B) seems best addressed through increased training scale. (C) breaks down into a variety of scenarios; for a given category of task, the issue could be something that is easily addressed (e.g. through fine-tuning), something that is likely to be knocked off by another round of training scale... or a deep issue that simple scaling won't address.

(I think of o1 as showing that for a certain category of chain-of-thought task, models in the gpt-4 class were suffering from a category-C problem, and the additional training given to o1 addressed that problem. But I don't have much confidence in my understanding here.)

Expand full comment
Nathan Lambert's avatar

This is a different way, maybe more specific, to describe what I was thinking.

I don't really want to attribute any to random noise, as they're technically types of failures you mentioned below, related to training, but I suppose with temperature >0 sampling it is truly random.

At the end of the day, the math squishes them all together. You need mechanistic interpretability to disambiguate, yes?

Your point about o1 is something that's been on my mind, and I agree. You can get away with "less model" and avoid a previously common type of error.

Expand full comment
JS Denain's avatar

> decreases with quadratic compute increase

If you're scaling Chinchilla-optimally, shouldn't this be quartic rather than quadratic? Since L(N, D) varies as about N^-0.5 and D^-0.5, with C = 6ND.

Expand full comment
Nathan Lambert's avatar

Yes, quartic-ish! Mostly was thinking shape and didn't have the right word in my head. I had bounced between a couple

Expand full comment