Discussion about this post

User's avatar
James Wang's avatar

An honest question on how this works, because there’s a lot of broad speculation, and I’ve gotten confused as to what this stuff is or isn’t. There’s been a lot of loose language and redefinition of previously terminology that has other connotations, if not hard definitions (AGI, what it is or isn’t now… and “reasoning” here).

Some of OpenAI’s rather… convenient… marketing language (understandable given their business priorities) hasn’t helped either.

At the same time as this is a “single forward pass of a model that is trained with RL” but it chooses the most common response with something like a consensus@N method… but it doesn’t have an evaluator model? But results are often replicable with CoT from repeated prompting in a “non reasoning” model?

What is the actual nature of the reasoning here? I can understand the conceptual (whether or not it’s actually implemented this way) “run a bunch of times and also have internal extra prompting to get more consistent and further” idea. That would also make conceptual sense why the inference costs and time scale as they do.

But then if it’s just a forward pass of a plain old language model, are we saying that it generates the tokens in that same way and hides things until the final output to the user? That would also fit the mental, conceptual model and would explain why some of these repeated prompting cases have replicated o1 results.

Or is this a completely wrong understanding of what this is?

Expand full comment
3 more comments...

No posts