Thanks for the overview! I especially appreciated the reflection at the end. I think this would provide more value to your average reader than a straightforward roundup post- increasingly the models can do that for us lol
The post-benchmark thing resonates. I ran an experiment last week where I gave 4 Opus 4.6 agents the same task with zero coordination, then had a 5th agent synthesize the results. No benchmark could have predicted what happened — the agents developed completely different approaches, and the synthesis was better than any individual output.
Same pattern I found when stress-testing multi-agent workflows (wrote it up: https://thoughts.jock.pl/p/opus-4-6-agent-experiment-2026) — the gap between 'benchmark performance' and 'real-world emergent behavior' is widening, not shrinking.
Hot take: We'll stop comparing models within a year. The interesting comparison will be model + orchestration layer + tool access. Raw model capability is becoming table stakes.
I don't really think so. If anything, GPT 5.2 was maybe that -- it felt way different, which could've meant a new pretraining to effectively serve on Blackwell, as we know that labs choose model sizes based on inference characteristics.
I thought a lot of the smoke and mirrors in "codex built codex 5.3" marketing was meh.
Training flops and infra will shine org-level over time.
For me, Codex 5.3 works pretty well. What I really dislike is Claude’s tendency to do more things than I ever asked for, instruction following remains Claude’s biggest weakness.
I’ve been using OpenCode, and without Hooks, it always seems to forget what to remember after a short timeframe. GPT/Codex is far better at following instructions over a long period. You described prompts like "create a branch and push it," but my workflow is very different; I would never write a prompt like that. I plan using "Superpowers", which I’ve customized to use "beads."
Using this method, I’ve managed to build out very rich features (epics) on a relatively large codebase of around 1 million LOC. GPT 5.3 Codex and GPT 5.2 are significantly better at following AGENTS.md and handling skills and beads tasks. A huge plus is that I can always start a fresh conversation because the context doesn’t need to live in a single session or chat, it’s a massive improvement. I do agree that GPT 5.2 Codex and earlier versions weren't great and were definitely weaker than the current 5.3.
Another thing I forgot to mention: Claude always seems to consume and burn through more tokens. GPT is much more efficient at managing the context window itself.
Opus is way more expensive. It’s never been worth it for me. Even Trae editor removed it from their model list. They’re falling way behind on pricing. Sure, it used to be “the best” model for coding, but the problem is the price.
Thanks for sharing your experiences working with these models. As someone doing a lot of independent AI in science evals, it is reassuring to understand what it is like to be pushing models at the edge of their abilities and having to keep up with change every few months.
I am constantly reminded of the famous "no moats" memo that somebody circulated inside Google back in 2023. I think any transient advantage that one lab has will soon be competed away. Not clear to me how anyone actually makes any money on this
That doesn’t seem like the conclusion at all? These services are immensely valuable and they’re making a lot of money. I’d pay more than I do if I had to.
My point is that these models are becoming progressively more undifferentiated. If someone offered a competitive model at a lower cost, you would likely switch to that quickly. Unlike, say, a social network, these models are not very sticky (except for the inertia of making the change).
I've been feeling this exact shift lately. The benchmark numbers for these new models have basically become noise at this point. I don't care if Codex 5.3 scores a few points higher on some synthetic eval if I still have to manually double-check every git command or file move it makes.
It really comes down to the developer experience. Anthropic seems to understand that a model being a "good coworker" is more valuable than it being a "genius that breaks the build." I was looking at the breakdown for Opus 4.6 https://automatio.ai/id/models/claude-opus-4-6 and it’s pretty clear they’re leaning into that agentic flow where it just handles the mundane stuff without needing a 500-word prompt to stay on track. OpenAI is still great for hunting down a specific obscure bug, but for actually shipping features without the constant babysitting, Claude is still the one I’d trust more for general dev work. The gap in product feel is way bigger than the gap in raw logic right now.
Mostly it’s hard to be always thinking about what your agents are doing. It’s a mental tax. Need to find ways to get them to be more independent, but I don’t have much to add over my previous post https://www.interconnects.ai/p/get-good-at-agents
Thanks for the overview! I especially appreciated the reflection at the end. I think this would provide more value to your average reader than a straightforward roundup post- increasingly the models can do that for us lol
The post-benchmark thing resonates. I ran an experiment last week where I gave 4 Opus 4.6 agents the same task with zero coordination, then had a 5th agent synthesize the results. No benchmark could have predicted what happened — the agents developed completely different approaches, and the synthesis was better than any individual output.
Same pattern I found when stress-testing multi-agent workflows (wrote it up: https://thoughts.jock.pl/p/opus-4-6-agent-experiment-2026) — the gap between 'benchmark performance' and 'real-world emergent behavior' is widening, not shrinking.
Hot take: We'll stop comparing models within a year. The interesting comparison will be model + orchestration layer + tool access. Raw model capability is becoming table stakes.
"Codex 5.3 feels far more different than its predecessors."
Do you think (or at least suspect) that this difference is because it's the first purely Blackwell-trained model on the market right now?
I don't really think so. If anything, GPT 5.2 was maybe that -- it felt way different, which could've meant a new pretraining to effectively serve on Blackwell, as we know that labs choose model sizes based on inference characteristics.
I thought a lot of the smoke and mirrors in "codex built codex 5.3" marketing was meh.
Training flops and infra will shine org-level over time.
Good post
That’s so true, the only way to truly measure capabilities of this new year. I also will be just to test them and test their different features.
Amazing breakdown Nathan! I need to try to play around with Codex maybe eventually!
For me, Codex 5.3 works pretty well. What I really dislike is Claude’s tendency to do more things than I ever asked for, instruction following remains Claude’s biggest weakness.
I’ve been using OpenCode, and without Hooks, it always seems to forget what to remember after a short timeframe. GPT/Codex is far better at following instructions over a long period. You described prompts like "create a branch and push it," but my workflow is very different; I would never write a prompt like that. I plan using "Superpowers", which I’ve customized to use "beads."
Using this method, I’ve managed to build out very rich features (epics) on a relatively large codebase of around 1 million LOC. GPT 5.3 Codex and GPT 5.2 are significantly better at following AGENTS.md and handling skills and beads tasks. A huge plus is that I can always start a fresh conversation because the context doesn’t need to live in a single session or chat, it’s a massive improvement. I do agree that GPT 5.2 Codex and earlier versions weren't great and were definitely weaker than the current 5.3.
IMO Claude is like apple. Best in first party app and not prioritizing partners.
Another thing I forgot to mention: Claude always seems to consume and burn through more tokens. GPT is much more efficient at managing the context window itself.
What is your usual smattering of tasks that you used these latest
models for?
Opus is way more expensive. It’s never been worth it for me. Even Trae editor removed it from their model list. They’re falling way behind on pricing. Sure, it used to be “the best” model for coding, but the problem is the price.
If you use via Claude Code, its way cheaper then using with Cursor for example.
Thanks for sharing your experiences working with these models. As someone doing a lot of independent AI in science evals, it is reassuring to understand what it is like to be pushing models at the edge of their abilities and having to keep up with change every few months.
Post-benchmark’ is the right framing tool-fit > leaderboard. The usability angle is what most comparisons miss.
No benchmarks means frontier models can plateau for years and investors and users won't be able to track anything..
I am constantly reminded of the famous "no moats" memo that somebody circulated inside Google back in 2023. I think any transient advantage that one lab has will soon be competed away. Not clear to me how anyone actually makes any money on this
That doesn’t seem like the conclusion at all? These services are immensely valuable and they’re making a lot of money. I’d pay more than I do if I had to.
My point is that these models are becoming progressively more undifferentiated. If someone offered a competitive model at a lower cost, you would likely switch to that quickly. Unlike, say, a social network, these models are not very sticky (except for the inertia of making the change).
For many uses cases changing models is actually a huge pain. Most companies building with APIs find something that works then never want to touch it.
I've been feeling this exact shift lately. The benchmark numbers for these new models have basically become noise at this point. I don't care if Codex 5.3 scores a few points higher on some synthetic eval if I still have to manually double-check every git command or file move it makes.
It really comes down to the developer experience. Anthropic seems to understand that a model being a "good coworker" is more valuable than it being a "genius that breaks the build." I was looking at the breakdown for Opus 4.6 https://automatio.ai/id/models/claude-opus-4-6 and it’s pretty clear they’re leaning into that agentic flow where it just handles the mundane stuff without needing a 500-word prompt to stay on track. OpenAI is still great for hunting down a specific obscure bug, but for actually shipping features without the constant babysitting, Claude is still the one I’d trust more for general dev work. The gap in product feel is way bigger than the gap in raw logic right now.
Do you have any thoughts on what kind of stack makes it easier to do agentic work, and how the stack should be modified to be more agent friendly?
Mostly it’s hard to be always thinking about what your agents are doing. It’s a mental tax. Need to find ways to get them to be more independent, but I don’t have much to add over my previous post https://www.interconnects.ai/p/get-good-at-agents