2 Comments

Glad to see people talking about the secret sauce behind o1 again! I feel like in the week after it was announced/launched there was a lot of that type of talk, but since then, my vague sense is that the discourse has shifted more towards how good the model is, rather than what's making it good.

I've been thinking a lot about this question of single autoregressive generation vs. search, so it's funny that you say your initial take was based on trying to take OpenAI at face value. I actually kind of reached the opposite conclusion at the time by trying to take them at face value--they explicitly said it was a model, not a system, and the handful of true CoT tokens they showed in the announcement were each obviously a single token trajectory, not multiple paths in parallel. So I figured if it was actually doing search at test-time and choosing the best paths with some kind of reward model, then they were deliberately obfuscating that. But then your early posts convinced me that actually, that was very possible!

I wrote a bit about those differing strategies and what they might look like (https://llmpromptu.substack.com/i/150696329/what-we-dont-know-about-o-model-or-system) a few weeks ago, and it's something I'm still thinking about all the time.

One thought I keep coming back to lately is that I don't see why you couldn't do search + reward at train time, and then wrangle the results of that into a single chain of thought that you then train the model on so that it can reproduce that in a single token trajectory at inference time. E.g., somewhere during training have the model generate n candidate answers, pick the best one with your reward model, and then from that construct chains of thought that start with one of the dispreferred answers and then "naturally" pivot to the preferred one. Then train the model on those chains of thought, so that it learns to produce chains where it pivots from worse to better answers in a single path.

Anyways, all very interesting stuff. Thanks for linking the Sasha Rush video, will be checking that out!

Expand full comment

I think my first post was overthinking the plot they shared. Plus, there were tweets from team members that very much nerd sniped me. I have a lot of cynicism in knowing OpenAI could try and be clever in what they say vs what actually is happening.

I think you CAN do training time search, but it actually makes it harder.

Expand full comment