Discussion about this post

User's avatar
juluc's avatar

i commented on your continual learning piece back in august about running multiple claude code instances and now, given the context of your post, i want to share what i've built since then, because i think it relates. disclaimer: i developed this for workflows i encounter frequently in my work, so it's shaped by that and basically evolves every day. to start: it's not about "fully switching" to another model when you're stuck, it's more about building a system where opus first establishes a context-aware intent, and then orchestrates calls to specialized proprietary tools or, to other model provider apis while maintaining full context itself, and each external call is clean and isolated and, in a curated way, adds to the given context of the state of a convo. if you've read about recursive language models, it's kind of similar to that but more hand-held. so one main thing i've learned: opus needs to know! it's the main expert, like the shot-caller. the external models can be called for (a) facts, then they just retrieve information or state what they see or (b) they are more like "independent consultants", operating in isolated contexts (see what i mean below), their opinions may or may not be relevant or useful, and opus (after i make sure it's intent-aligned) decides what to actually use.

so let me explain what i mean. my work is research, documentation and communications-heavy. like for everybody right now, claude opus via claude code is my main interface. opus is amazing at capturing signal (or intent), working agentically and coherently on longer running tasks, but it needs to know what to use to accomplish a given task, and it needs to be reminded to not one-shot, and work sequentially through tasks via calling external tools. this is, at latest after the ralph wiggum loop blow-up, common knowledge. so whenever opus needs something it can't do well (or at all), like a deep web research, transcribing a voice memo, analyzing a pdf visually, in my system, i have skills defined, which describe or call tools, such that opus then shells out to proprietary tools i built or external model apis. this is simple python wrappers for gpt and gemini that claude calls like any cli tool.

the key things therefore are intent-alignment (which people do via planning mode or spec-driven development), context-surfacing (curating claude.md, skill definitions, hooks), context-isolation (subagents, other model api calls) and calibration (mostly via a mix of skill definitions, claude.md). one thing i've learned about intent-alignment: at the start of a session, don't let opus give you a 500-word synthesis of the current state. align on intent fast, then bounce back and forth with shorter iterations. i call this "high signal" mode, information-dense, no fluff. this matters because when external model opinions come in, opus needs a strong anchor on what i actually want before it starts integrating those opinions.

each project starts with a signal—could be a voice memo, a meeting, a forwarded chat. i process it via skills (transcribe, search emails), then run discovery via subagents to find what i've already done on this topic. files accumulate as i work; each project folder gets a CLAUDE.md with curated context. when sessions run long, a handover skill creates state files for the next session. so before opus calls any external api, it already knows what's going on.

so on a basic level, this is how i call external tools or apis. the key is CLI arguments. when opus needs to call out, its internal state (it has listed relevant files, read some fully, some partially, gotten an index of potentially relevant files from subagents) lets it decide agentically which files are relevant for this isolated research task. for gpt it looks like: `--file notes.md --file state.md --file mail-chain.md --task "research X"`. the script stuffs these into the api call with xml markers so gpt knows there's a main task (the anchor) and context files (clearly named and hierarchized), returns the result in stdout, and claude reads it, decides what's actually useful given the original intent, and continues. the external model gets a clean isolated slice, it doesn't need conversation history because claude curated exactly what it needs.

the models have different strengths. gpt-5 always has web search, so i use it for anything needing current information—market research, fact-checking, finding docs. gemini is better for multimodal (pdfs, images, audio transcription). the wrappers have presets: for gpt it's reasoning effort (`light`/`balanced`/`deep`), for gemini it's model selection plus thinking-level. most queries use `light`—quick 1-minute lookups without even attaching context files.

a workflow i use constantly: voice memos while walking, transcribed via gemini, then project discovery spawns parallel subagents to map the workspace and find what i've already done. half the time it surfaces useful state from weeks ago that i'd forgotten. the system acts as external memory.

what i've been harnessing lately is hooks that log invocations of skills and subagents. i log every skill invocation to a jsonl file (timestamp, skill name, args, session id). a hook fires immediately after each skill that calls haiku (basically free via the claude agent sdk) to infer the purpose of that invocation from the conversation context. then at session end, another hook feeds the entire transcript to gemini 3 flash and asks it to assess whether each skill actually helped, what the user response was, whether the task progressed. the assessments get written back to the jsonl so i can query them later and improve the skills based on semantic patterns observed. after a few hundred sessions, heuristics accumulate. parallel searches with different scopes catch things single searches miss. the system builds patterns from its own usage data and i can make my skills and subagents better.

i think the interesting thing here is not the multi-model part per se but the architecture as a whole: opus as the main expert, external models as consultants that get clean isolated calls, and a feedback loop that tracks what actually works. intent-aligned opus decides what to use from the external opinions, sometimes everything, sometimes nothing. claude cowork will probably absorb some of this, but there's still a lot of value in building your own stack because the models are so jagged.

Michael Power's avatar

In a much recommended comment I wrote in last Friday's Financial Times to an article entitled: DeepSeek rival’s shares double in debut as Chinese AI companies rush to list (https://www.ft.com/content/a4fc6106-5a61-4a89-9400-c17c87fb1920#comments-anchor) I replied as follows:

You fundamentally misunderstand the emerging character of the Chinese LLM community. It is not so much competitive as 'co-opetitive'. Being Open Weight, they share architectural software improvements willingly whilst each individual LLM concentrates on a slightly different - yet complementary - area of expertise. What is emerging is a Dragon Swarm whose watchword is consilience. DeepSeek is the Architect Dragon whose Open‑Weight 'foundation model excellence' (rich in software design features willingly shared) will be massively reinforced when R2 drops mid February, not coincidentally coinciding with the advent of the Year of the Fire Horse. Deep Seek is the bedrock of the swarm - the 'Mother of Dragons' if you will. Aside from being the technical supremo, it is optimized for all-round reasoning and general intelligence. MiniMax is the Creative & Sonic Dragon, a specialist in multimodal creativity – text, voice, music and immersive content synthesis. Deep Seek and Minimax (and Qwen, Kimi, Ubiquant, 01.AI, ZiAI, Sensetime and more) are not so much rivals as members of a Dragon Swarm of Open Weight LLMs covering an extraordinarily wide range of expertises.

20 more comments...

No posts

Ready for more?