Opus 4.6, Codex 5.3, and the post-benchmark…

The post-benchmark thing resonates. I ran an experiment last week where I gave 4 Opus 4.6 agents the same task with zero coordination, then had a 5th agent synthesize the results. No benchmark could have predicted what happened — the agents developed completely different approaches, and the synthesis was better than any individual output.

Same pattern I found when stress-testing multi-agent workflows (wrote it up: https://thoughts.jock.pl/p/opus-4-6-agent-experiment-2026) — the gap between 'benchmark performance' and 'real-world emergent behavior' is widening, not shrinking.

Hot take: We'll stop comparing models within a year. The interesting comparison will be model + orchestration layer + tool access. Raw model capability is becoming table stakes.

Kevin Xu

"Codex 5.3 feels far more different than its predecessors."

Do you think (or at least suspect) that this difference is because it's the first purely Blackwell-trained model on the market right now?

I don't really think so. If anything, GPT 5.2 was maybe that -- it felt way different, which could've meant a new pretraining to effectively serve on Blackwell, as we know that labs choose model sizes based on inference characteristics.

I thought a lot of the smoke and mirrors in "codex built codex 5.3" marketing was meh.

Training flops and infra will shine org-level over time.

Good post

That’s so true, the only way to truly measure capabilities of this new year. I also will be just to test them and test their different features.

Amazing breakdown Nathan! I need to try to play around with Codex maybe eventually!

Max Weimann

For me, Codex 5.3 works pretty well. What I really dislike is Claude’s tendency to do more things than I ever asked for, instruction following remains Claude’s biggest weakness.

I’ve been using OpenCode, and without Hooks, it always seems to forget what to remember after a short timeframe. GPT/Codex is far better at following instructions over a long period. You described prompts like "create a branch and push it," but my workflow is very different; I would never write a prompt like that. I plan using "Superpowers", which I’ve customized to use "beads."

Using this method, I’ve managed to build out very rich features (epics) on a relatively large codebase of around 1 million LOC. GPT 5.3 Codex and GPT 5.2 are significantly better at following ⁠AGENTS.md and handling skills and beads tasks. A huge plus is that I can always start a fresh conversation because the context doesn’t need to live in a single session or chat, it’s a massive improvement. I do agree that GPT 5.2 Codex and earlier versions weren't great and were definitely weaker than the current 5.3.

Reply (2)

IMO Claude is like apple. Best in first party app and not prioritizing partners.

Max Weimann

Another thing I forgot to mention: Claude always seems to consume and burn through more tokens. GPT is much more efficient at managing the context window itself.

Juan

What is your usual smattering of tasks that you used these latest

models for?

Joseph

Feb 12

Opus is way more expensive. It’s never been worth it for me. Even Trae editor removed it from their model list. They’re falling way behind on pricing. Sure, it used to be “the best” model for coding, but the problem is the price.

Kinder Grinder

Feb 16

If you use via Claude Code, its way cheaper then using with Cursor for example.

Manjari Narayan

Mar 5

Thanks for sharing your experiences working with these models. As someone doing a lot of independent AI in science evals, it is reassuring to understand what it is like to be pushing models at the edge of their abilities and having to keep up with change every few months.

Practical AI Brief

Feb 19

Post-benchmark’ is the right framing tool-fit > leaderboard. The usability angle is what most comparisons miss.

TokioJack

Feb 15

No benchmarks means frontier models can plateau for years and investors and users won't be able to track anything..

Rangachari Anand

I am constantly reminded of the famous "no moats" memo that somebody circulated inside Google back in 2023. I think any transient advantage that one lab has will soon be competed away. Not clear to me how anyone actually makes any money on this

That doesn’t seem like the conclusion at all? These services are immensely valuable and they’re making a lot of money. I’d pay more than I do if I had to.

Rangachari Anand

My point is that these models are becoming progressively more undifferentiated. If someone offered a competitive model at a lower cost, you would likely switch to that quickly. Unlike, say, a social network, these models are not very sticky (except for the inertia of making the change).

For many uses cases changing models is actually a huge pain. Most companies building with APIs find something that works then never want to touch it.

Kinder Grinder

I've been feeling this exact shift lately. The benchmark numbers for these new models have basically become noise at this point. I don't care if Codex 5.3 scores a few points higher on some synthetic eval if I still have to manually double-check every git command or file move it makes.

It really comes down to the developer experience. Anthropic seems to understand that a model being a "good coworker" is more valuable than it being a "genius that breaks the build." I was looking at the breakdown for Opus 4.6 https://automatio.ai/id/models/claude-opus-4-6 and it’s pretty clear they’re leaning into that agentic flow where it just handles the mundane stuff without needing a 500-word prompt to stay on track. OpenAI is still great for hunting down a specific obscure bug, but for actually shipping features without the constant babysitting, Claude is still the one I’d trust more for general dev work. The gap in product feel is way bigger than the gap in raw logic right now.

Eric

Do you have any thoughts on what kind of stack makes it easier to do agentic work, and how the stack should be modified to be more agent friendly?