Discussion about this post

User's avatar
David Dabney's avatar

Thanks for the overview! I especially appreciated the reflection at the end. I think this would provide more value to your average reader than a straightforward roundup post- increasingly the models can do that for us lol

Pawel Jozefiak's avatar

The post-benchmark thing resonates. I ran an experiment last week where I gave 4 Opus 4.6 agents the same task with zero coordination, then had a 5th agent synthesize the results. No benchmark could have predicted what happened — the agents developed completely different approaches, and the synthesis was better than any individual output.

Same pattern I found when stress-testing multi-agent workflows (wrote it up: https://thoughts.jock.pl/p/opus-4-6-agent-experiment-2026) — the gap between 'benchmark performance' and 'real-world emergent behavior' is widening, not shrinking.

Hot take: We'll stop comparing models within a year. The interesting comparison will be model + orchestration layer + tool access. Raw model capability is becoming table stakes.

9 more comments...

No posts

Ready for more?