i commented on your continual learning piece back in august about running multiple claude code instances and now, given the context of your post, i want to share what i've built since then, because i think it relates. disclaimer: i developed this for workflows i encounter frequently in my work, so it's shaped by that and basically evolves every day. to start: it's not about "fully switching" to another model when you're stuck, it's more about building a system where opus first establishes a context-aware intent, and then orchestrates calls to specialized proprietary tools or, to other model provider apis while maintaining full context itself, and each external call is clean and isolated and, in a curated way, adds to the given context of the state of a convo. if you've read about recursive language models, it's kind of similar to that but more hand-held. so one main thing i've learned: opus needs to know! it's the main expert, like the shot-caller. the external models can be called for (a) facts, then they just retrieve information or state what they see or (b) they are more like "independent consultants", operating in isolated contexts (see what i mean below), their opinions may or may not be relevant or useful, and opus (after i make sure it's intent-aligned) decides what to actually use.
so let me explain what i mean. my work is research, documentation and communications-heavy. like for everybody right now, claude opus via claude code is my main interface. opus is amazing at capturing signal (or intent), working agentically and coherently on longer running tasks, but it needs to know what to use to accomplish a given task, and it needs to be reminded to not one-shot, and work sequentially through tasks via calling external tools. this is, at latest after the ralph wiggum loop blow-up, common knowledge. so whenever opus needs something it can't do well (or at all), like a deep web research, transcribing a voice memo, analyzing a pdf visually, in my system, i have skills defined, which describe or call tools, such that opus then shells out to proprietary tools i built or external model apis. this is simple python wrappers for gpt and gemini that claude calls like any cli tool.
the key things therefore are intent-alignment (which people do via planning mode or spec-driven development), context-surfacing (curating claude.md, skill definitions, hooks), context-isolation (subagents, other model api calls) and calibration (mostly via a mix of skill definitions, claude.md). one thing i've learned about intent-alignment: at the start of a session, don't let opus give you a 500-word synthesis of the current state. align on intent fast, then bounce back and forth with shorter iterations. i call this "high signal" mode, information-dense, no fluff. this matters because when external model opinions come in, opus needs a strong anchor on what i actually want before it starts integrating those opinions.
each project starts with a signal—could be a voice memo, a meeting, a forwarded chat. i process it via skills (transcribe, search emails), then run discovery via subagents to find what i've already done on this topic. files accumulate as i work; each project folder gets a CLAUDE.md with curated context. when sessions run long, a handover skill creates state files for the next session. so before opus calls any external api, it already knows what's going on.
so on a basic level, this is how i call external tools or apis. the key is CLI arguments. when opus needs to call out, its internal state (it has listed relevant files, read some fully, some partially, gotten an index of potentially relevant files from subagents) lets it decide agentically which files are relevant for this isolated research task. for gpt it looks like: `--file notes.md --file state.md --file mail-chain.md --task "research X"`. the script stuffs these into the api call with xml markers so gpt knows there's a main task (the anchor) and context files (clearly named and hierarchized), returns the result in stdout, and claude reads it, decides what's actually useful given the original intent, and continues. the external model gets a clean isolated slice, it doesn't need conversation history because claude curated exactly what it needs.
the models have different strengths. gpt-5 always has web search, so i use it for anything needing current information—market research, fact-checking, finding docs. gemini is better for multimodal (pdfs, images, audio transcription). the wrappers have presets: for gpt it's reasoning effort (`light`/`balanced`/`deep`), for gemini it's model selection plus thinking-level. most queries use `light`—quick 1-minute lookups without even attaching context files.
a workflow i use constantly: voice memos while walking, transcribed via gemini, then project discovery spawns parallel subagents to map the workspace and find what i've already done. half the time it surfaces useful state from weeks ago that i'd forgotten. the system acts as external memory.
what i've been harnessing lately is hooks that log invocations of skills and subagents. i log every skill invocation to a jsonl file (timestamp, skill name, args, session id). a hook fires immediately after each skill that calls haiku (basically free via the claude agent sdk) to infer the purpose of that invocation from the conversation context. then at session end, another hook feeds the entire transcript to gemini 3 flash and asks it to assess whether each skill actually helped, what the user response was, whether the task progressed. the assessments get written back to the jsonl so i can query them later and improve the skills based on semantic patterns observed. after a few hundred sessions, heuristics accumulate. parallel searches with different scopes catch things single searches miss. the system builds patterns from its own usage data and i can make my skills and subagents better.
i think the interesting thing here is not the multi-model part per se but the architecture as a whole: opus as the main expert, external models as consultants that get clean isolated calls, and a feedback loop that tracks what actually works. intent-aligned opus decides what to use from the external opinions, sometimes everything, sometimes nothing. claude cowork will probably absorb some of this, but there's still a lot of value in building your own stack because the models are so jagged.
You fundamentally misunderstand the emerging character of the Chinese LLM community. It is not so much competitive as 'co-opetitive'. Being Open Weight, they share architectural software improvements willingly whilst each individual LLM concentrates on a slightly different - yet complementary - area of expertise. What is emerging is a Dragon Swarm whose watchword is consilience. DeepSeek is the Architect Dragon whose Open‑Weight 'foundation model excellence' (rich in software design features willingly shared) will be massively reinforced when R2 drops mid February, not coincidentally coinciding with the advent of the Year of the Fire Horse. Deep Seek is the bedrock of the swarm - the 'Mother of Dragons' if you will. Aside from being the technical supremo, it is optimized for all-round reasoning and general intelligence. MiniMax is the Creative & Sonic Dragon, a specialist in multimodal creativity – text, voice, music and immersive content synthesis. Deep Seek and Minimax (and Qwen, Kimi, Ubiquant, 01.AI, ZiAI, Sensetime and more) are not so much rivals as members of a Dragon Swarm of Open Weight LLMs covering an extraordinarily wide range of expertises.
I feel like the idea is presented a bit more strongly than I’d say it but yes I agree, and think the Chinese ecosystem is more interesting in its entire ecosystem dynamic that makes up for the models being slightly less good in absolute terms
Nathan, You need to check out DeepSeek's latest paper which dropped this morning . The title is: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. It has radical implications for what Memory is and what can be...and how it can be structured. I see it as but the next REVEAL in the build up to V4...and possibly even the release of R2.
For now...I agree. But after MLA - especially if soon to be complemented by v4, and perhaps soon thereafter 'all 'wrapped' in R2 (with extra features likely added then))- can you be so sure?
The open models make the overall pace of progress higher, which benefits open model builders most (who happen to be behind). If MLA helps, closed labs take it too. I see DeepSeek as the open lab with the best track record of innovation, and their architecture did a lot to start the wave in 2025, but in the long term it may not look like a repeating cycle.
I don't doubt the closed labs will be all over MLA. V4 too when it drops. And any of the 'extras' in R2...My point - and you have made this observation regarding OS LLMs better than any one else - is that with MLA 'reinventing' training (compressing attention memory by 93% during training) plus all the architectural elegance now evident on the inference side, the flood gates for the OS models being widely adopted are about to open... I simply do not see the old order 'brute force' CW models holding the trump card they have in the past (and I think Nvidia knows this: Nemotron 3; Groq!). I am not hypnotized by Benchmarks (especially with Yann LeCun revealing how Meta gamed the system!) but I think that R2 - when it drops - is going to unleash a wave of upgrades (CW as well as OW). But more importantly the barriers to constructing great LLMs will be lowered... FOR EVERYONE!
I'd love for you to be right. I'm currently slightly more bearish on DeepSeek's capabilities. I think more people have caught up and its hard to get ahead.
Let's speak again on 18 February...the day after the Chinese New Year begins and the Year of the Fire Horse starts. I don't think even the techies at DeepSeek will be able to resist using the start of BY FAR the most powerful year in the 60 year Chinese Zodiac to make a statement!
I do not say YOU misunderstand this. Not at all! What I said was THE WRITERS OF THE FT ARTICLE misunderstand what is happening. I suppose my comment merely wanted to suggest that were you to access DeepSeek after R2, I doubt you would encounter any 'jaggedness' between swarm members- One for All; All for One. YSo you would not need to use multiple closed weight LLMs to serve your needs. And there will be no fees involved either.
This is the most practical breakdown of multi-model workflows I've seen. Your observation that switching models 'regularly solves the task' is the key insight - it means we're at a capability frontier where each model has high probability of success, just with different failure modes.
Your stack matches mine almost exactly: GPT Thinking for research verification, Claude Opus for code and creative feedback, Gemini for multimodal work. The jaggedness of capabilities makes mono-model workflows feel increasingly limiting.
The workflow orchestration layer becomes crucial at scale. When coordinating multiple models across tasks, having a system that remembers which model works best for what is where the real productivity gains come from.
As a non-tech idiot who regularly gets in way over my head on vibecoding projects, I've developed a clunky method of consulting multiple models when the one I'm working with gets stuck or seems off base. I ask it to write up a memo describing the bug or strategic question or whatever and paste that into the other two of the ChatGPT, Claude, Gemini "Crew" and into a new instance of whatever one I'm working with (with the instruction that it's a naive model who should ignore any context that it comes across). Then I share the results with the model I've been working with -- and sometimes "fire" it and switch to working with another!
Is there a better way of doing the same thing? Meaning either likely to get better results or to take less time/effort. A lot of time the advice is great, especially on bugs. The most maddening things is that models won't tell me about available better solutions that I don't know to ask about.
Honestly sounds like about it. There are ways where you can use the API versions + set them up to share context, but that's a bit of effort and I haven't tried it yet even.
Fascinating article! I'm at the other end of the spectrum: as a retired AI researcher, I don't use *any* AI models in my normal life and am much the happier for it :-)
thanks for sharing. Your approach is similar to what many other AI heavy users (multiple models for different tasks) and it tells a story in itself: one model fits them all does not work.
I would add to the stack Qwen chat: even in the free tier the way it handles conversation history, memories and user interaction its unique. And the image generation is amazing.
Recently I discovered Perplexity (I have a free Pro subscription through Revolut): the way it grounds every statement on links and reverences is reassuring. And for coding (in Python) it does a surprisingly good job!
Helpful but this feels more like your cognitive model for routing than a "stack". I was intrigued to learn how your implementing context eng, workflow orchestration, building evals, red teaming, etc!
Hah I wondered if that might be it, still seems to be the most common answer I less you're at one of the FDE + platform companies trying to sell the platform (Dystil, Invisible, Crew, etc)
i commented on your continual learning piece back in august about running multiple claude code instances and now, given the context of your post, i want to share what i've built since then, because i think it relates. disclaimer: i developed this for workflows i encounter frequently in my work, so it's shaped by that and basically evolves every day. to start: it's not about "fully switching" to another model when you're stuck, it's more about building a system where opus first establishes a context-aware intent, and then orchestrates calls to specialized proprietary tools or, to other model provider apis while maintaining full context itself, and each external call is clean and isolated and, in a curated way, adds to the given context of the state of a convo. if you've read about recursive language models, it's kind of similar to that but more hand-held. so one main thing i've learned: opus needs to know! it's the main expert, like the shot-caller. the external models can be called for (a) facts, then they just retrieve information or state what they see or (b) they are more like "independent consultants", operating in isolated contexts (see what i mean below), their opinions may or may not be relevant or useful, and opus (after i make sure it's intent-aligned) decides what to actually use.
so let me explain what i mean. my work is research, documentation and communications-heavy. like for everybody right now, claude opus via claude code is my main interface. opus is amazing at capturing signal (or intent), working agentically and coherently on longer running tasks, but it needs to know what to use to accomplish a given task, and it needs to be reminded to not one-shot, and work sequentially through tasks via calling external tools. this is, at latest after the ralph wiggum loop blow-up, common knowledge. so whenever opus needs something it can't do well (or at all), like a deep web research, transcribing a voice memo, analyzing a pdf visually, in my system, i have skills defined, which describe or call tools, such that opus then shells out to proprietary tools i built or external model apis. this is simple python wrappers for gpt and gemini that claude calls like any cli tool.
the key things therefore are intent-alignment (which people do via planning mode or spec-driven development), context-surfacing (curating claude.md, skill definitions, hooks), context-isolation (subagents, other model api calls) and calibration (mostly via a mix of skill definitions, claude.md). one thing i've learned about intent-alignment: at the start of a session, don't let opus give you a 500-word synthesis of the current state. align on intent fast, then bounce back and forth with shorter iterations. i call this "high signal" mode, information-dense, no fluff. this matters because when external model opinions come in, opus needs a strong anchor on what i actually want before it starts integrating those opinions.
each project starts with a signal—could be a voice memo, a meeting, a forwarded chat. i process it via skills (transcribe, search emails), then run discovery via subagents to find what i've already done on this topic. files accumulate as i work; each project folder gets a CLAUDE.md with curated context. when sessions run long, a handover skill creates state files for the next session. so before opus calls any external api, it already knows what's going on.
so on a basic level, this is how i call external tools or apis. the key is CLI arguments. when opus needs to call out, its internal state (it has listed relevant files, read some fully, some partially, gotten an index of potentially relevant files from subagents) lets it decide agentically which files are relevant for this isolated research task. for gpt it looks like: `--file notes.md --file state.md --file mail-chain.md --task "research X"`. the script stuffs these into the api call with xml markers so gpt knows there's a main task (the anchor) and context files (clearly named and hierarchized), returns the result in stdout, and claude reads it, decides what's actually useful given the original intent, and continues. the external model gets a clean isolated slice, it doesn't need conversation history because claude curated exactly what it needs.
the models have different strengths. gpt-5 always has web search, so i use it for anything needing current information—market research, fact-checking, finding docs. gemini is better for multimodal (pdfs, images, audio transcription). the wrappers have presets: for gpt it's reasoning effort (`light`/`balanced`/`deep`), for gemini it's model selection plus thinking-level. most queries use `light`—quick 1-minute lookups without even attaching context files.
a workflow i use constantly: voice memos while walking, transcribed via gemini, then project discovery spawns parallel subagents to map the workspace and find what i've already done. half the time it surfaces useful state from weeks ago that i'd forgotten. the system acts as external memory.
what i've been harnessing lately is hooks that log invocations of skills and subagents. i log every skill invocation to a jsonl file (timestamp, skill name, args, session id). a hook fires immediately after each skill that calls haiku (basically free via the claude agent sdk) to infer the purpose of that invocation from the conversation context. then at session end, another hook feeds the entire transcript to gemini 3 flash and asks it to assess whether each skill actually helped, what the user response was, whether the task progressed. the assessments get written back to the jsonl so i can query them later and improve the skills based on semantic patterns observed. after a few hundred sessions, heuristics accumulate. parallel searches with different scopes catch things single searches miss. the system builds patterns from its own usage data and i can make my skills and subagents better.
i think the interesting thing here is not the multi-model part per se but the architecture as a whole: opus as the main expert, external models as consultants that get clean isolated calls, and a feedback loop that tracks what actually works. intent-aligned opus decides what to use from the external opinions, sometimes everything, sometimes nothing. claude cowork will probably absorb some of this, but there's still a lot of value in building your own stack because the models are so jagged.
In a much recommended comment I wrote in last Friday's Financial Times to an article entitled: DeepSeek rival’s shares double in debut as Chinese AI companies rush to list (https://www.ft.com/content/a4fc6106-5a61-4a89-9400-c17c87fb1920#comments-anchor) I replied as follows:
You fundamentally misunderstand the emerging character of the Chinese LLM community. It is not so much competitive as 'co-opetitive'. Being Open Weight, they share architectural software improvements willingly whilst each individual LLM concentrates on a slightly different - yet complementary - area of expertise. What is emerging is a Dragon Swarm whose watchword is consilience. DeepSeek is the Architect Dragon whose Open‑Weight 'foundation model excellence' (rich in software design features willingly shared) will be massively reinforced when R2 drops mid February, not coincidentally coinciding with the advent of the Year of the Fire Horse. Deep Seek is the bedrock of the swarm - the 'Mother of Dragons' if you will. Aside from being the technical supremo, it is optimized for all-round reasoning and general intelligence. MiniMax is the Creative & Sonic Dragon, a specialist in multimodal creativity – text, voice, music and immersive content synthesis. Deep Seek and Minimax (and Qwen, Kimi, Ubiquant, 01.AI, ZiAI, Sensetime and more) are not so much rivals as members of a Dragon Swarm of Open Weight LLMs covering an extraordinarily wide range of expertises.
I feel like the idea is presented a bit more strongly than I’d say it but yes I agree, and think the Chinese ecosystem is more interesting in its entire ecosystem dynamic that makes up for the models being slightly less good in absolute terms
Nathan, You need to check out DeepSeek's latest paper which dropped this morning . The title is: Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. It has radical implications for what Memory is and what can be...and how it can be structured. I see it as but the next REVEAL in the build up to V4...and possibly even the release of R2.
For now...I agree. But after MLA - especially if soon to be complemented by v4, and perhaps soon thereafter 'all 'wrapped' in R2 (with extra features likely added then))- can you be so sure?
The open models make the overall pace of progress higher, which benefits open model builders most (who happen to be behind). If MLA helps, closed labs take it too. I see DeepSeek as the open lab with the best track record of innovation, and their architecture did a lot to start the wave in 2025, but in the long term it may not look like a repeating cycle.
I don't doubt the closed labs will be all over MLA. V4 too when it drops. And any of the 'extras' in R2...My point - and you have made this observation regarding OS LLMs better than any one else - is that with MLA 'reinventing' training (compressing attention memory by 93% during training) plus all the architectural elegance now evident on the inference side, the flood gates for the OS models being widely adopted are about to open... I simply do not see the old order 'brute force' CW models holding the trump card they have in the past (and I think Nvidia knows this: Nemotron 3; Groq!). I am not hypnotized by Benchmarks (especially with Yann LeCun revealing how Meta gamed the system!) but I think that R2 - when it drops - is going to unleash a wave of upgrades (CW as well as OW). But more importantly the barriers to constructing great LLMs will be lowered... FOR EVERYONE!
I'd love for you to be right. I'm currently slightly more bearish on DeepSeek's capabilities. I think more people have caught up and its hard to get ahead.
Let's speak again on 18 February...the day after the Chinese New Year begins and the Year of the Fire Horse starts. I don't think even the techies at DeepSeek will be able to resist using the start of BY FAR the most powerful year in the 60 year Chinese Zodiac to make a statement!
I do not say YOU misunderstand this. Not at all! What I said was THE WRITERS OF THE FT ARTICLE misunderstand what is happening. I suppose my comment merely wanted to suggest that were you to access DeepSeek after R2, I doubt you would encounter any 'jaggedness' between swarm members- One for All; All for One. YSo you would not need to use multiple closed weight LLMs to serve your needs. And there will be no fees involved either.
(Yeah I read it quickly first thing in the morning and misunderstood sorry, I deleted it)
:)
This is the most practical breakdown of multi-model workflows I've seen. Your observation that switching models 'regularly solves the task' is the key insight - it means we're at a capability frontier where each model has high probability of success, just with different failure modes.
Your stack matches mine almost exactly: GPT Thinking for research verification, Claude Opus for code and creative feedback, Gemini for multimodal work. The jaggedness of capabilities makes mono-model workflows feel increasingly limiting.
The workflow orchestration layer becomes crucial at scale. When coordinating multiple models across tasks, having a system that remembers which model works best for what is where the real productivity gains come from.
I built LLMatcher to help find the right model for different tasks: https://thoughts.jock.pl/p/llmatcher-update-personal-ai-discovery
As a non-tech idiot who regularly gets in way over my head on vibecoding projects, I've developed a clunky method of consulting multiple models when the one I'm working with gets stuck or seems off base. I ask it to write up a memo describing the bug or strategic question or whatever and paste that into the other two of the ChatGPT, Claude, Gemini "Crew" and into a new instance of whatever one I'm working with (with the instruction that it's a naive model who should ignore any context that it comes across). Then I share the results with the model I've been working with -- and sometimes "fire" it and switch to working with another!
Is there a better way of doing the same thing? Meaning either likely to get better results or to take less time/effort. A lot of time the advice is great, especially on bugs. The most maddening things is that models won't tell me about available better solutions that I don't know to ask about.
Love your appearances with Jordan et al. Thanks!
Honestly sounds like about it. There are ways where you can use the API versions + set them up to share context, but that's a bit of effort and I haven't tried it yet even.
Thanks!
Fascinating article! I'm at the other end of the spectrum: as a retired AI researcher, I don't use *any* AI models in my normal life and am much the happier for it :-)
thanks for sharing. Your approach is similar to what many other AI heavy users (multiple models for different tasks) and it tells a story in itself: one model fits them all does not work.
I would add to the stack Qwen chat: even in the free tier the way it handles conversation history, memories and user interaction its unique. And the image generation is amazing.
Recently I discovered Perplexity (I have a free Pro subscription through Revolut): the way it grounds every statement on links and reverences is reassuring. And for coding (in Python) it does a surprisingly good job!
Love this - what I did to make this a bit easier for myself
https://counsel.getmason.io/
https://github.com/mercurialsolo/counsel-mcp
An open source multi-council MCP that can work with multiple models and get them to debate with each other
Helpful but this feels more like your cognitive model for routing than a "stack". I was intrigued to learn how your implementing context eng, workflow orchestration, building evals, red teaming, etc!
Most is remarkably simple and yolo, I’m in the early days of figuring one out for Claude cod
Hah I wondered if that might be it, still seems to be the most common answer I less you're at one of the FDE + platform companies trying to sell the platform (Dystil, Invisible, Crew, etc)