17 Comments
User's avatar
Nasheed Yasin's avatar

Here's something that I've come to realise: A year of finetuning say gpt-5 for your custom task can be undone with the release of gpt-6 that can just best you zero-shot. However, training search and rec sys models will always be based on your specific use case. The way I answer what's relevant will be different from the way you answer it for a different problem.

I think it's best for everyone not at a frontier lab to get really really good at training these vector or generally contrastive-ly trained models. From an existential pov

Expand full comment
Nathan Lambert's avatar

Yup is my plan to train them and share howso

Expand full comment
Dominic Caldwell's avatar

Nasheed: a slight alternative idea, perhaps, is that we need to use every model to build the deep meta-prompts that will allow us to best engage the next model. That way we’re not losing anything, we’re building the context and specifics that allow us to have and build unique artifacts.

Expand full comment
Thomas Bustos's avatar

Meta prompts and evolving context in apps have higher defensiveness.

Expand full comment
Michael Pieler's avatar

> training search and rec sys models will always be based on your specific use case.

I have been thinking about this a lot.

I guess similarity search is something that is not yet easily doable with text LLMs.

However, this is maybe something that can be done in the future?

Happy to discuss this in more detail!

Expand full comment
Nasheed Yasin's avatar

Quite the contrary, text Generation models do really well with similarity search. The LLM2Vec recipe has become mainstream over the last year or so.

Further the concept of instruction following embedders has also been somewhat mainstream-ized. You can basically explain what similar is in the prompt.

However, research from deep mind highlights that there are limitations to training similarity embedding models in the way we've been doing. Here's a link to the paper https://arxiv.org/abs/2508.21038.

And I've some work on rl tuning embedding models after contrastive learning to better align to what's similar in your use case. Unpublished, but the idea involves using the perplexity of a downstream next token prediction model on a gold answer as the reward signal. @Nathan, would love to get your views on this bit.

Expand full comment
Michael Pieler's avatar

Yes, you are completely right with a classic text-only similarity search pipeline with (repurposed) LLMs, i.e., text input --> embedding --> calculate similarity score based on embedded query.

After rereading my sentence above

"I guess similarity search is something that is not yet easily doable with text LLMs."

I should have been adding more details, sorry, I typed that too quickly, but I was thinking along the lines of a (text) LLM doing similarity search in a text/chat interface (w/o surfacing any embeddings to the user). But maybe this is better solved by a tool than getting such a capability into a LLM ... but maybe there is already work out there, that I'm not aware of?

Expand full comment
Paul Chen's avatar

Hi Nathan, great write-up! What do you think of the implications on data as we didn’t naturally have action data in pre-training. The Tongyi-Research series papers start to add multi-step tool uses trajectories into pre-training. Would this save us from hitting the scaling law wall on the data front?

Expand full comment
Sri's avatar

Almost like search for inputs, thinking for processing, and execution for outputs seems to capture the whole range of what an LLM does. Love it.

The biggest alpha of this article is that there is so much opportunities to save money at the inference end - while I agree on paper, isn't the time horizon of tasks gonna keep going up while the subscriptions costs haven't changed this year?

Expand full comment
Mia's avatar

If “thinking, searching, acting” are the new primitives, maybe weights aren’t the real differentiator anymore. Is the real race less about bigger models and more about the scaffolding we build around them?

Expand full comment
Dominic Caldwell's avatar

A superb and clarifying post, Nathan. You've perfectly articulated the architectural shift from static models to the T/S/A (Thinking, Searching, Acting) paradigm that now defines the frontier. Your piece is the most lucid explanation I've seen of the engineering reality that has emerged post-o1.

The question that keeps me up at night, however, isn't about the primitives themselves, but about the ecosystems they will create. This leads me to think there's a missing fourth primitive, one that operates on a fourth axis: Translation. In theory, translation / interface / interoperability protocols allow incompatible reasoning models—each with its own proprietary T/S/A stack, its own values, its own "common sense"—to interact, disagree, and trade information without collapsing into a monoculture or destroying each other.

Translating is a "fuzzy" problem involving hermeneutics, legal reasoning, and negotiating meaning across different contexts. That would imply that part of the upcoming AI challenges involve using the humanities and life sciences to connect a thousand different T/S/A stacks.

Excellent work, as always.

Expand full comment
JV's avatar
Sep 23Edited

Seems slightly redundant to list searching and acting separately. They are just special cases of tool use, acting would cover both.

That aside, I don't see why tool use should affect inference economics significantly. If I run Claude Code, all the tool use happens on my computer. I assume it's the same for e.g. ChatGPT: search and other tool use happens on some traditional server, not on the inference servers with the GPUs? Those just get similar LLM calls as before.

Expand full comment
Nathan Lambert's avatar

Without fresh information LLMs cannot do so many tasks.

When the cpus are calling tools expensive GPUs are idle. This is a big deal.

Expand full comment
JV's avatar
Sep 23Edited

Right, but the GPUs don't have to wait for the tool use, just for any LLM call. A GPU somewhere isn't waiting for me to reply to ChatGPT, it's handling other chats or whatever. Why would one wait for a tool call to finish?

Expand full comment
Nathan Lambert's avatar

It's not easy to swap the KV caches around mid generation, it is making inference more complex! So far has been minimal, but if the tools are in the real world it could be complicated, all im saying is I want more info.

Expand full comment
Dominic Caldwell's avatar

Also, at a broader level, “acting” is itself one of the basic forms of intelligence because it tends to require new systems for sensing and distinguishing between self and other (“searching” and “self-awareness”). You can reject the idea of consciousness or self-awareness in these systems and still believe that “acting” is an extremely important step in the development of “intelligence.”

Expand full comment
TheAISlop's avatar

Reasonings roots in q leaning which became strawberry 🍓🍓 and finally scaled into a transformer model aka o1 has been a great step. But not hearing anyone talk about a subsequent similar substantial breakthrough. Feels like early full self driving with a fast batch of wins to get to ok results. They rest becomes a long slow slog for a decade.

Expand full comment