The weaknesses of today’s best models are far from those of the original ChatGPT — we see they lack speed, we fear superhuman persuasion, and we aspire for our models to be more autonomous. These models are all reasoning models that have long surpassed the original weaknesses of ChatGPT-era language models, hallucinations, total lack of recent information, complete capitulations, and other hiccups that looked like minor forms of delusion laid on top of an obviously spectacular new technology.
Reasoning models today are far more complex than the original chatbots that consisted of standalone model weights (and other lightweight scaffolding such as safety filters). They're built on three primitives that'll be around for years to come:
Thinking: The reasoning traces that enabled inference-time scaling. The "thoughts" of a reasoning model take a very different form than those of humans that inspired the terminology used like Chain of Thought (CoT) or Thinking models.
Searching: The ability to request more, specific information from non-parametric knowledge stores designed specifically for the model. This fills the void set by how model weights are static but living in a dynamic world.
Acting: The ability for models to manipulate the physical or digital world. Everything from code-execution now to real robotics in the future allow language models to contact reality and overcome their nondeterministic core. Most of these executable environments are going to build on top of infrastructure for coding agents.
These reasoning language models, as a form of technology are going to last far longer than the static model weights that predated and birthed ChatGPT. Sitting just over a year out from the release of OpenAI's o1-preview on September 12, 2024, the magnitude of this is important to write in ink. Early reasoning models with astounding evaluation scores were greeted with resounding criticism of “they won’t generalize,” but that has turned out to be resoundingly false.
In fact, with OpenAI's o3, it only took 3-6 months for these primitives to converge! Still, it took the AI industry more broadly a longer time to converge on this. The most similar follow-up on the search front was xAI's Grok 4 and some frontier models such as Claude 4 express their reasoning model nature in a far more nuanced manner. OpenAI's o3 (and GPT-5 Thinking, a.k.a. Research Goblin) and xAI's Grok 4 models seem like a dog determined to chase their goal indefinitely and burn substantial compute along the way. Claude 4 has a much softer touch, resulting in a model that is a bit less adept at search, but almost always returns a faster answer. The long-reasoning traces and tool use can be crafted to fit different profiles, giving us a spectrum of reasoning models.
The taxonomy that I laid out this summer for next-generation reasoning models — skills for reasoning intelligence, calibration to not overthink, strategy to choose the right solutions, and abstraction to break them down — are the traits that'll make a model most functional given this new perspective and agentic world.
The manner of these changes are easy to miss. For one, consider hallucinations, which are an obvious weakness downstream of the stochastic inference innate to the models and their fixed date cutoff. With search, hallucinations are now missing context rather than blatantly incorrect content. Language models are nearly-perfect at copying content and similarly solid at referencing it, but they're still very flawed at long-context understanding. Hallucinations still matter, but it’s a very different chapter of the story and will be studied differently depending on if it is for reasoning or non-reasoning language models.
Non-reasoning models still have a crucial part to play in the AI economy due to their efficiency and simplicity. They are part of a reasoning model in a way because you can always use the weights without tools and they'll be used extensively to undergird the digital economy. At the same time, the frontier AI models (and systems) of the coming years will all be reasoning models as presented above — thinking, searching, and acting.1
Language models will get access to more tools of some form, but all of them will be subsets of code or search. In fact, search can be argued to be a form of execution itself, but given the imperative of the underlying information it is best left as its own category.
Another popular discussion with the extremely-long generations of reasoning models has been the idea that maybe more efficient architectures such as diffusion language models could come to dominate by generating all the tokens in parallel. The (or rather, one) problem here is that they cannot easily integrate tools, such as search or execution, in the same way. These’ll also likely be valuable options in the AI quiver, but barring a true architectural or algorithmic revolution that multiplies the performance of today’s AI models, the efficiency and co-design underway for large transformers will enable the most dynamic reasoning models.
With establishing what makes a reasoning model complete comes an important mental transition in what it takes to make a good model. Now, the quality of the tools that a model is embedded with is arguably something that can be more straightforward to improve than the model — it just takes substantial engineering effort — and is far harder with open models. The AI “modeling” itself is mostly open-ended research.
Closed models have the benefit of controlling the entire user experience with the stack, where open models need to be designed so that anyone can take the weights off of HuggingFace and easily get a great experience deploying it with open-source libraries like VLLM or SGLang. When it comes to tools used during inference, this means that the models can have a recommended setting that works best, but they may take time to support meaningful generalization with respect to new tools.
For example, OpenAI can train and serve their models with only one search engine, where I at Ai2 will likely train with one search engine and then release the model into a competitive space of many search products. A space where this can benefit open models could be something like MCP, where open models are developed innately for a world where we cannot know all the uses of our models, making something like MCP libraries a great candidate for testing. Of course, leading AI laboratories will (or have already started) do this, but the ranking will be different in a priority stack.
Much has been said about tokenomics and costs associated with reasoning models, without taking the tool component into account. There was a very popular article articulating how models are only getting more expensive, with a particular focus on reasoning models using far more tokens. This is overstating a blip, a point in time when serving costs increased by 1000x for models by generating vastly more tokens, but without improved hardware.
The change in cost of reasoning models reflected a one-time step up in most circumstances where the field collectively turned on inference-time scaling by using the same reasoning techniques. At the same time as the reasoning model explosion, the size of models reaching users in parameter count has all but stagnated. This is due to diminishing returns in quality due to scaling parameters — it’s why OpenAI said GPT 4.5 wasn’t a frontier model and why Gemini never released their Ultra model class. The same will come for reasoning tokens.
While diminishing returns are hitting reasoning token amount for serial streams, we’re finally seeing large clusters of Nvidia’s Blackwell GPUs come online. The costs for models seem well on path to level out and then decrease as the industry develops more efficient inference systems — the technology industry is phenomenal at making widely used products far cheaper year over year. The costs that’ll go up are the agents that are enabled by these reasoning models, especially with parallel inference, such as the Claude Code clones or OpenAI’s rumored Pro products.
What we all need is a SemiAnalysis article explaining how distorted standard tokenomics are for inference with tools and if tools substantially increase variance in implementations. People focus too much on the higher token costs from big models with long context lengths, those are easy to fix with better GPUs, while there are many other costs such as search indices or idle GPU time waiting for tool execution results.
When we look at a modern reasoning model, it is easy to fixate on the thinking token aspects that give the models their name. At the same time, search and execution are such fundamental primitives to modern language models that they can rightfully stand on their own as pillars of modern AI. These are AI systems that substantially depend on the quality of the complex inference stack far more than getting the right YOLO run for the world’s best model weights.
The cause of thinking, searching, and acting all being looped in as a “reasoning model” is that this inference-time scaling with meandering chains of thought was the technological innovation that made both search and execution far more functional. Reasoning was the step change event that set these three as technology standards.
The industry is in its early days of building out fundamental infrastructure to enable them, which manifests as the early days of language model agents. The infrastructure pairs deterministic computing and search with the beauty, power, and flexibility of the probabilistic models we fell in love with via ChatGPT. This reasoning model layer is shaping up to be the infrastructure that underpins the greatest successes of the future technology industry.
Barring the unpredictable scientific breakthrough in architecture.