I am self-employed working on various consulting-like projects - all my files live in a repo and I have 6 screens and 10+ claude code instances open at all times - I build a basic harness / agent-framework that understands to soak up compute for @project_x , then I just see with my own expertise or my network what the next best action is to get a signal for how to progress on a given project. all my communication with this system is via voice - i sometimes record myself for an hour or so talking about my ideas on project_x, then i consult someone i know who knows better to verify my ideas and rapidly iterate to get to a state where i can draft the next pitch / mail / whatsapp / call, whatever. I am now working on 10+ projects and at the same time building my system out further and through these rapid actions and interactions and genuinely institutional-grade work I do, my network expands as well (-> more context, more projects, more systems).
Continual learning will be just building systems that soak up compute and scale with search - you can make your system intelligent enough to make connections you can‘t even see yourself - I am sometimes in awe when claude code refers to an unrelated project and maps this onto my current one. and this is happening NOW. If you have access to experts, are ai-native and know how to start things, go for this too. Even if this won‘t work, you‘ll learn so much that you‘ll be a tier a talent when these systems are not build on claude code but by the big labs inheritently themselves.
One thought I have been having lately is whether I am not indirectly also building a super powerful RL env for labs to buy lmao (i have a lot of idiosyncratic data through the nature of my work). Maybe you can answer that @nathan
So much excitement for this idea across folks. I'm going to try and build a repo for this that people can use to experiment with Claude Code over interconnects :)
just a thought I had when I listened to edwin chen from surge ai describe one of their envs they create synthetically and curate.
but to add to that, I think human work will just change to being a converter of signals, having LLMs to identify these signals and then pursue actions based on your own intuition or expert‘s guidances.
Wow, this is a really cool idea. Do you have all of these projects living in the same repo? I’m sort of imagining a setup where you have one folder per project, and each project folder has its own CLAUDE.md file to describe how the agent in that particular project should operate. Then you could also have a CLAUDE.md file in the main, overarching folder and run a Claude Code instance there to surface connections between your different projects. Is that roughly on the right track?
I do have basic claude.md files, but they are just to embedd the core operational best practices - i have other agents / commands (HUGE monoliths, super carefully curated) that handle the claude.md stuff - claude.md is mostly just stuff that correct my bash commands to not fail 2x before using the right approach haha - and again, core principles are: semantic richness, VERY limited interpretation or inference (example: „oh, your mail mentioned a project worth 100k, that means you have 100k! if X,Y and Z happen -> NOT TRUE; i.e., context pollution is your WORST enemy). most important thing: see LLMs as your CPUs that run your „raw files“ based on your prompt / instruction (in my case often 20-60m audios) and THEN infer or interpret based on my instructions. another key principle is therefore preservation of raw context - every, „oh wait i actually meant“ is valuable and gives nuance to your context.
i see ICL as a rank-1 weight update - the stronger you can pull the context towards a semantic region (i.e., eliciting the right features and circuits) so it lands and stays there, the lower your effort in steering and correcting drifts - that‘s why all system prompts are so large and so should any agent calibration be - for that you need to ofc understand what the agent needs to do and for that you need projects/ ideas / experiments.
hope that helps albeit not answering your question 1 to 1.
Have been doing this with a few projects where I'm advising as a product person. Have found that putting product requirement docs, meeting notes and other items into well structured git repos and then analysing / extending using a code copilot can work wonders.
Also building for custom search/retrieve MCPs here, since the default indexing in code IDEs assumes code whereas I have a lot of non code. And functional domain context.
Would love to collaborate with you, or just jam on ideas, if you'd find it helpful
Geoff Hinton offers a way to reconcile both of your positions. When asked about the most important problem in AI recently (other than safety), he emphasized that training and inference shouldn’t be viewed as distinct categories, in his view, they’re parts of the same continuous process. As Dwarkesh describes, continual learning represents the training/fine-tuning aspect of weight adjustments and long-term model adaptation. Meanwhile, your use of "memory" aligns more with in-context learning, which focuses on inference using context and prompts. In Hinton’s framework, “learning” spans both approaches, and as algorithms and architectures evolve, both fine-tuning and dynamic context-based learning will remain integral to truly intelligent systems. I am currently working on an interesting way to fuse gradient descent and context engineering/tool usage.
I believe AGI is a milestone that can only be labeled in hindsight.
Looking back, historians will be able to determine a point in time where it turns out to be just scaling up. That day in current time will just glide by as 'just another day'.
"The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning."
One way of framing this debate is: when we bolt a vector DB and RAG layer on to LLM, are we proposing a general algorithm that can be scaled up, or are we building knowledge into our agent?
It would have been nice to touch on the bitter lesson here.
My take would be that yes what I'm suggesting is anti-bitter lesson in some ways, at least in the long term, but practical about the compute situation.
Dwarkesh's proposal would be a better AI, but if it won't be here for 20 years does it apply?
I asked GPT5 about this intuition. I paste the entire response because it is well balanced. From private sources I know that a major continuous learning llm architecture will be announced soon. I wonder how it will stack against this.
——
You raise a very good point — it’s quite plausible that, in the long run, an integrated continual learning / memory module co-trained with the rest of the system will outperform a detached memory component that’s “bolted on” later. The “bitter lesson” suggests that general-purpose architectures plus scale tend to win over domain-specific hacks, so one might expect that the system as a whole — including memory — should emerge via large-scale training rather than manual modularization.
Let me unpack how I see the tradeoffs, and where I think your intuition has merit (and where challenges lie).
⸻
Recap of what Dwarkesh / Nathan argue
From the article, the key argument from Dwarkesh is that the lack of continual learning (i.e. models improving over time in situ) is a major limitation of existing LLM systems. Nathan responds by saying that:
• Rather than building human-like perpetual learners, we’re better off scaling context, memory, and retrieval systems so that the model feels like it’s learning continuously.
• The “memory / context” component is a systems problem (how you store, retrieve, and feed in past information), not necessarily a new learning algorithm per se. 
• Increasing context length, better retrieval, chaining of LLMs, etc., can approximate what people mean by continual learning in many use cases. 
• The “bottleneck” idea of continual learning may be overemphasized — we may just iterate systems that look like they adapt.
So Nathan’s view is: memory + retrieval + context scaling is a more realistic near-term path; fully agentic continual learning might come later or differently.
⸻
Why an integrated memory + learning system might outperform a modular memory
Here’s where your intuition is strong: in principle, there is advantage in joint optimization of memory and the rest of the model:
1. Representational Synergy
If the memory (or long-term state) is part of the same optimization pipeline, the model can shape representations with memory in mind, rather than having to adapt to a fixed memory interface. The memory could store latent embeddings, residuals, or compressed codes that are most helpful for future prediction, not just “raw past tokens.”
2. End-to-end gradients / feedback
If the memory is differentiable and participates in backprop, the system can learn which memories to store, how to compress them, and how to read them in a way that’s aligned with downstream task performance. This is more powerful than heuristic memory insertion or retrieval.
3. Dynamic capacity and plasticity
A unified system can allocate memory capacity dynamically (e.g. “learn more in some domains, forget in others”) as a function of how useful past information is for future tasks. A modular memory might be more rigid or conservative.
When memory is modular, you must design interfaces: what gets stored, how it is retrieved, how it’s encoded/decoded, when to evict, etc. Misalignments or suboptimal heuristics can bottleneck performance. By contrast, if memory is integrated, the system can discover its own “interface.”
5. Better generalization and meta-learning
With an integrated memory and model, the learner might internalize meta-strategies — e.g. “I should store this kind of pattern because I’ll need it later.” Over many tasks, the system could learn memory operations (write, read, skip, compress) as part of its training. This might approach something like a learned continual learner.
Thus, an integrated memory-augmented model (e.g. something like end-to-end transformers with persistent states, or neural Turing machines, memory networks co-trained) has a theoretical advantage.
⸻
Why the modular or detached memory + retrieval approach is competitive (and hard to beat)
However, there are strong arguments why the modular memory + retrieval or “context + retrieval engine” paradigm is more practical short-to-medium term, and might even beat integrated systems in many real settings:
1. Scalability / engineering simplicity
Retrieval + memory modules are relatively easy to scale, optimize, and shard independently (e.g. vector databases, approximate nearest neighbor search, memory indexing). Integrating memory into the model end-to-end is more challenging in terms of compute, memory, and engineering complexity.
2. Data efficiency
Modular memory can store actual data points that the system can recall precisely (e.g. past conversations, documents). The model doesn’t need to memorize everything internally; it can rely on the memory store as external ground truth. This reduces the burden on the core model.
3. Modularity / replaceability
If memory is modular, it can be improved, replaced, or scaled without retraining the entire model. You can upgrade your memory (better indexing, caching, vector search) independently.
4. Catastrophic interference / stability
End-to-end continual learning is notoriously hard: new information can overwrite old patterns. Integrated systems must manage plasticity vs. stability, avoid forgetting, and balance update dynamics. Modular memory helps isolate the “storage” from the “learning core,” reducing interference.
5. Signal sparsity and supervision
Not all memory writes are useful; supervision for which memories matter is often sparse. It might be hard to learn, from scratch, a memory-write policy that is robust. Heuristics and retrieval-based systems provide “bootstraps.”
6. Compute constraints
If every memory operation is part of backprop, the compute cost is large. Practical systems may prefer memory to be “out-of-band” (like a database) so the main model can remain efficient.
7. Engineering maturity
The retrieval / memory + context paradigm is already being deployed (vector DBs, RAG, retrieval-augmented models). They’re battle-tested in real-world systems. Integrated continual learners, while alluring, are still experimental.
⸻
Where the “bitter lesson” comes in and how it aligns with your intuition
The bitter lesson (as Rich Sutton phrased it) is that methods which leverage scale and general-purpose architectures often dominate over hand-crafted domain-specific solutions. In this light:
• The memory / retrieval module is somewhat domain-specific. One might argue that the “pure scale + architecture” approach would prefer that memory emerges naturally inside the model through massive training, rather than being a distinct module with heuristics.
• Over many tasks and large scale, a system might implicitly learn to “remember” things in its weights (or structured latent states) rather than relying on an external memory. This would correspond to your view: that the “memory” becomes just another part of the unified network.
• If one trains gigantic models over streams of data, perhaps the memory function becomes emergent rather than explicit — i.e. the model internalizes the bookkeeping.
Thus, from the perspective of the bitter lesson, I’d lean toward your intuition: the long-run winner is likely a more fused system rather than a purely modular memory bolt-on, especially once scale and compute are huge.
That said, the intermediate regime (where modular memory + retrieval is used) is likely to persist for many years, because of the pragmatic reasons above.
⸻
My balanced view & predictions
Here’s where I land:
• In the near to mid-term (5 years or so), we will see hybrid systems: big LLM cores + external memory modules + clever retrieval and caching, with some gradient-based fine-tuning or adaptation. The memory module will not be black-box: it will be co-designed, and might even receive some gradient signals or learning, but not be fully integrated in all respects.
• Over the longer term (10+ years), as compute, data, and algorithms improve, more of the memory functionality may migrate inside the trained network: read-write heads, differentiable state, etc., so that the overall system is more monolithic.
• The “modular memory + retrieval” approach might still survive, but its role becomes more “fast cache / ground truth store,” while the core model retains a kind of internal memory over longer horizons.
So yes — I agree with you: the “bolt-on memory” approach is a valuable stepping stone, but the eventual winner is probably one where memory is native, learned together, and deeply integrated rather than grafted. The bitter lesson suggests you should not rely on manual interfaces forever if scale and generality are your goals.
If you like, I can try to sketch architectures or papers that are already exploring integrated memory & continual learning, and compare them to the modular memory + retrieval designs. Do you want me to do that?
this is the central argument, IMO. I disagree on context / memory engineering solving for continual learning.
When a human worker learns, he/she not only continually obtains knowledge, but also advances its approach to problems and intuition.
This is the critical piece that separates “capabilities” from “AGI”. Systems are not just gaining knowledge - they are self-improving in its ability to gain knowledge.
Take a high school biology student vs a microbiology PhD candidate, same person, 10 years apart. The PhD candidate is far superior in his/her ability to make breakthroughs in the field. This is the same person!
Certainly, an LLM in 2025 begins with PhD level knowledge, gained in a much shorter time of pretraining than 10 years of intensive schooling. However, can I count on the LLM to autonomously indentify problem areas, perform original research and thesis, collaborate and build relationships with other researchers, present ideas, earn a reputation for unique problem approaches, make witted, nerdy jokes about bio research, and continue to mature thinking patterns over time?
These are qualities that appear as general intelligence to me. Context and memory engineering will not achieve this.
I think what you’re describing is not only accumulating and attending to context but a model’s weights changing over time? Seems like fine tuning / RLing on frequently-retrieved important memories would achieve this.
Currently, fine tuning amounts to modifying existing weights to change a model’s behavior, such that some tokens will get greater probabilities than other tokens in the base model. It gets better at a domain of specific prediction tasks, but doesn’t not generally improve.
A human student is quite adaptive to context efficient learning. That isn’t just about what we learn, but how we learn, and how the “approach to problem” itself gets better over time.
I do not see current RL / LoRA / DPO achieving that. It simply modifies an existing intelligence to behave in different ways.
Case in point, take GPT-5 or Claude Sonnet 4.1, some of the most advanced models. You can chat with it about a deep topic every day for a month, and it will recall some of your conversations, but it will perform no better than the off-the-shelf model.
You will not get a Dan Carlin with his ooomph in history lessons. You will not get a Stephen King with his wildly imaginative horror. You will not get a Weird Al Yankovic with a unique humor that persists for 40 years. These people were not born as such, but through context-efficient learning, and some inherent genetic traits, they develop a sense of general intelligence and specialization.
I really doubt that the current memory features as implemented by ChatGPT/Claude will be enough. We as humans don’t do continual learning merely by writing down everything; there’s a ton of implicit information that we retain somewhat subconsciously. The memory features of ChatGPT/Claude are the equivalent of an amnesiac going through life relying on a very long diary.
I’m much more curious about something like a custom LoRA, which could in theory retain implicit/subconscious informations as well as explicit ones. Custom memories in latent space, in a way. But I’m not sure how far away we are from figuring that out.
This is kind of a technical question. Maybe we will move to fast vs slow weights. Maybe we will have a KV_cache lookup, so the model can load "in context memory", or maybe something else. I think we can agree that labs have not made a big attempt to solve it, like they are for test time RL. And if they do, chances seem good.
The term ‘continuous learning’ really just means learning in the human sense of the word.
Both Chollet and Demis have talked about intelligence being the ability to learn how to learn, which is rooted in play (highly recommend reading ‘Homo Ludens’ by Johan Huizinga).
AGI has always being a confusion term as it has been defined with humans as the heart of it. “As good or better than us in every single way”. But every single way is a lot of ways and thus creating AGI basically has come to mean building a human being from scratch. It’s a hilariously naive endavour, because we already have humans and we know how they’re made.
I like that you destinguish between the two and say that AI may not be doing human reasoning but it is certainly capable of some kind of reasoning. Where your argument falls flat, for me at least, is that you seems to believe without providing much evidence that the next version or the next-next version will magically solve our problems, whereas others believe that some of the inherent flaws cannot be overcome despite our best attempts at scaffolding and providing more and more and more context. It could very well be that after finding out that scaling was not all we need, context will also not be all we need. It could be we are building castles on shaky foundations.
Microsoft copilot has a bad reputation and somewhat desevedly so. But the fact its hooked up to pretty much all your work content that is updated all the time, Teams, Outlook, Sharepoint means, it can produce quite magical moments
There are aspects of the human brain, like emotion and creativity, that AI may never be able to adequately replicate. So, if one means thinking and acting 100% human, we are a long way from achieving that, if ever. However, there are many other brain functions (problem solving, adaptability, common sense reasoning) that I think are readily attainable, far beyond the ability of most current humans, because memory and speed of recall and common sense reasoning are crucial factors which allow most humans to do 90% of what most of us do ever day. I am not sure about abstract thinking (understanding and manipulating abstract concepts) however, which would be one of the prerequisites for AGI. If AI can not "learn" something new or discover something currently unknown to humans (which is not in a LLM), then it has the potential of perhaps another 5% of what most humans do. The last 5% may be unattainable.
I feel like you missed the point of Dwarkesh, it was not about that LLMs have to learn the way humans do, it's that LLMs in their current form are unable to learn new data continuously. Our LLMs are doomed to know anything beyond cutoff date. All knowledge after cutoff can be added to context, memory, data connector, or any external file, but it's not solving continual learning, which is how to update the model weights from the new conversations without catastrophic forgetting. In other words, how do we refresh the context every day, but still can remember essential pieces of yesterday conversation.
I fully agree with you that externally memory will be practically helpful, scaling is the way to go, but so far we have zero examples where LLMs can learn useful information continuously without degrading its original performance. And without such example I don't understand how you can be sure that there will be AGI that will solve this problem.
> it's that LLMs in their current form are unable to learn new data continuously
This is what I'm saying we're going to get easily, at least behaviorly, by giving them more context and compute (and tools). We don't actually have to update the weights because the models can do in context learning.
I am self-employed working on various consulting-like projects - all my files live in a repo and I have 6 screens and 10+ claude code instances open at all times - I build a basic harness / agent-framework that understands to soak up compute for @project_x , then I just see with my own expertise or my network what the next best action is to get a signal for how to progress on a given project. all my communication with this system is via voice - i sometimes record myself for an hour or so talking about my ideas on project_x, then i consult someone i know who knows better to verify my ideas and rapidly iterate to get to a state where i can draft the next pitch / mail / whatsapp / call, whatever. I am now working on 10+ projects and at the same time building my system out further and through these rapid actions and interactions and genuinely institutional-grade work I do, my network expands as well (-> more context, more projects, more systems).
Continual learning will be just building systems that soak up compute and scale with search - you can make your system intelligent enough to make connections you can‘t even see yourself - I am sometimes in awe when claude code refers to an unrelated project and maps this onto my current one. and this is happening NOW. If you have access to experts, are ai-native and know how to start things, go for this too. Even if this won‘t work, you‘ll learn so much that you‘ll be a tier a talent when these systems are not build on claude code but by the big labs inheritently themselves.
One thought I have been having lately is whether I am not indirectly also building a super powerful RL env for labs to buy lmao (i have a lot of idiosyncratic data through the nature of my work). Maybe you can answer that @nathan
So much excitement for this idea across folks. I'm going to try and build a repo for this that people can use to experiment with Claude Code over interconnects :)
More soon.
This is so cool - if I can help refine your v0.1, feel free to reach out!
Here's a template for a voice first second brain I put together a couple of weeks ago in make.com, a lot of branches could really use a proper general purpose agent rather than just a basic AI system call https://open.substack.com/pub/metacircuits/p/how-i-built-my-second-brain-in-3
I have no idea who will buy stuff but there's a lot of money flowing around.
Love this.
just a thought I had when I listened to edwin chen from surge ai describe one of their envs they create synthetically and curate.
but to add to that, I think human work will just change to being a converter of signals, having LLMs to identify these signals and then pursue actions based on your own intuition or expert‘s guidances.
Wow, this is a really cool idea. Do you have all of these projects living in the same repo? I’m sort of imagining a setup where you have one folder per project, and each project folder has its own CLAUDE.md file to describe how the agent in that particular project should operate. Then you could also have a CLAUDE.md file in the main, overarching folder and run a Claude Code instance there to surface connections between your different projects. Is that roughly on the right track?
I do have basic claude.md files, but they are just to embedd the core operational best practices - i have other agents / commands (HUGE monoliths, super carefully curated) that handle the claude.md stuff - claude.md is mostly just stuff that correct my bash commands to not fail 2x before using the right approach haha - and again, core principles are: semantic richness, VERY limited interpretation or inference (example: „oh, your mail mentioned a project worth 100k, that means you have 100k! if X,Y and Z happen -> NOT TRUE; i.e., context pollution is your WORST enemy). most important thing: see LLMs as your CPUs that run your „raw files“ based on your prompt / instruction (in my case often 20-60m audios) and THEN infer or interpret based on my instructions. another key principle is therefore preservation of raw context - every, „oh wait i actually meant“ is valuable and gives nuance to your context.
i see ICL as a rank-1 weight update - the stronger you can pull the context towards a semantic region (i.e., eliciting the right features and circuits) so it lands and stays there, the lower your effort in steering and correcting drifts - that‘s why all system prompts are so large and so should any agent calibration be - for that you need to ofc understand what the agent needs to do and for that you need projects/ ideas / experiments.
hope that helps albeit not answering your question 1 to 1.
This is really cool!
I iterated on this idea 10 months ago and ended up building custom “deep research” type agentic systems because it wasn't a consumer product yet
Maybe I should just be plugging Claude Code in everywhere instead
Like cloud vms running claude code wrapped in an MCP server or something - or is that a hotdog?
Have been doing this with a few projects where I'm advising as a product person. Have found that putting product requirement docs, meeting notes and other items into well structured git repos and then analysing / extending using a code copilot can work wonders.
Also building for custom search/retrieve MCPs here, since the default indexing in code IDEs assumes code whereas I have a lot of non code. And functional domain context.
Would love to collaborate with you, or just jam on ideas, if you'd find it helpful
All your different projects in one repo?
Wait wtf. Is Nathan scale pilled and dwarkesh not anymore??? Wild
Geoff Hinton offers a way to reconcile both of your positions. When asked about the most important problem in AI recently (other than safety), he emphasized that training and inference shouldn’t be viewed as distinct categories, in his view, they’re parts of the same continuous process. As Dwarkesh describes, continual learning represents the training/fine-tuning aspect of weight adjustments and long-term model adaptation. Meanwhile, your use of "memory" aligns more with in-context learning, which focuses on inference using context and prompts. In Hinton’s framework, “learning” spans both approaches, and as algorithms and architectures evolve, both fine-tuning and dynamic context-based learning will remain integral to truly intelligent systems. I am currently working on an interesting way to fuse gradient descent and context engineering/tool usage.
I believe AGI is a milestone that can only be labeled in hindsight.
Looking back, historians will be able to determine a point in time where it turns out to be just scaling up. That day in current time will just glide by as 'just another day'.
"The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning."
One way of framing this debate is: when we bolt a vector DB and RAG layer on to LLM, are we proposing a general algorithm that can be scaled up, or are we building knowledge into our agent?
It would have been nice to touch on the bitter lesson here.
My take would be that yes what I'm suggesting is anti-bitter lesson in some ways, at least in the long term, but practical about the compute situation.
Dwarkesh's proposal would be a better AI, but if it won't be here for 20 years does it apply?
I asked GPT5 about this intuition. I paste the entire response because it is well balanced. From private sources I know that a major continuous learning llm architecture will be announced soon. I wonder how it will stack against this.
——
You raise a very good point — it’s quite plausible that, in the long run, an integrated continual learning / memory module co-trained with the rest of the system will outperform a detached memory component that’s “bolted on” later. The “bitter lesson” suggests that general-purpose architectures plus scale tend to win over domain-specific hacks, so one might expect that the system as a whole — including memory — should emerge via large-scale training rather than manual modularization.
Let me unpack how I see the tradeoffs, and where I think your intuition has merit (and where challenges lie).
⸻
Recap of what Dwarkesh / Nathan argue
From the article, the key argument from Dwarkesh is that the lack of continual learning (i.e. models improving over time in situ) is a major limitation of existing LLM systems. Nathan responds by saying that:
• Rather than building human-like perpetual learners, we’re better off scaling context, memory, and retrieval systems so that the model feels like it’s learning continuously.
• The “memory / context” component is a systems problem (how you store, retrieve, and feed in past information), not necessarily a new learning algorithm per se. 
• Increasing context length, better retrieval, chaining of LLMs, etc., can approximate what people mean by continual learning in many use cases. 
• The “bottleneck” idea of continual learning may be overemphasized — we may just iterate systems that look like they adapt.
So Nathan’s view is: memory + retrieval + context scaling is a more realistic near-term path; fully agentic continual learning might come later or differently.
⸻
Why an integrated memory + learning system might outperform a modular memory
Here’s where your intuition is strong: in principle, there is advantage in joint optimization of memory and the rest of the model:
1. Representational Synergy
If the memory (or long-term state) is part of the same optimization pipeline, the model can shape representations with memory in mind, rather than having to adapt to a fixed memory interface. The memory could store latent embeddings, residuals, or compressed codes that are most helpful for future prediction, not just “raw past tokens.”
2. End-to-end gradients / feedback
If the memory is differentiable and participates in backprop, the system can learn which memories to store, how to compress them, and how to read them in a way that’s aligned with downstream task performance. This is more powerful than heuristic memory insertion or retrieval.
3. Dynamic capacity and plasticity
A unified system can allocate memory capacity dynamically (e.g. “learn more in some domains, forget in others”) as a function of how useful past information is for future tasks. A modular memory might be more rigid or conservative.
4. Avoiding module mismatch / interface engineering
When memory is modular, you must design interfaces: what gets stored, how it is retrieved, how it’s encoded/decoded, when to evict, etc. Misalignments or suboptimal heuristics can bottleneck performance. By contrast, if memory is integrated, the system can discover its own “interface.”
5. Better generalization and meta-learning
With an integrated memory and model, the learner might internalize meta-strategies — e.g. “I should store this kind of pattern because I’ll need it later.” Over many tasks, the system could learn memory operations (write, read, skip, compress) as part of its training. This might approach something like a learned continual learner.
Thus, an integrated memory-augmented model (e.g. something like end-to-end transformers with persistent states, or neural Turing machines, memory networks co-trained) has a theoretical advantage.
⸻
Why the modular or detached memory + retrieval approach is competitive (and hard to beat)
However, there are strong arguments why the modular memory + retrieval or “context + retrieval engine” paradigm is more practical short-to-medium term, and might even beat integrated systems in many real settings:
1. Scalability / engineering simplicity
Retrieval + memory modules are relatively easy to scale, optimize, and shard independently (e.g. vector databases, approximate nearest neighbor search, memory indexing). Integrating memory into the model end-to-end is more challenging in terms of compute, memory, and engineering complexity.
2. Data efficiency
Modular memory can store actual data points that the system can recall precisely (e.g. past conversations, documents). The model doesn’t need to memorize everything internally; it can rely on the memory store as external ground truth. This reduces the burden on the core model.
3. Modularity / replaceability
If memory is modular, it can be improved, replaced, or scaled without retraining the entire model. You can upgrade your memory (better indexing, caching, vector search) independently.
4. Catastrophic interference / stability
End-to-end continual learning is notoriously hard: new information can overwrite old patterns. Integrated systems must manage plasticity vs. stability, avoid forgetting, and balance update dynamics. Modular memory helps isolate the “storage” from the “learning core,” reducing interference.
5. Signal sparsity and supervision
Not all memory writes are useful; supervision for which memories matter is often sparse. It might be hard to learn, from scratch, a memory-write policy that is robust. Heuristics and retrieval-based systems provide “bootstraps.”
6. Compute constraints
If every memory operation is part of backprop, the compute cost is large. Practical systems may prefer memory to be “out-of-band” (like a database) so the main model can remain efficient.
7. Engineering maturity
The retrieval / memory + context paradigm is already being deployed (vector DBs, RAG, retrieval-augmented models). They’re battle-tested in real-world systems. Integrated continual learners, while alluring, are still experimental.
⸻
Where the “bitter lesson” comes in and how it aligns with your intuition
The bitter lesson (as Rich Sutton phrased it) is that methods which leverage scale and general-purpose architectures often dominate over hand-crafted domain-specific solutions. In this light:
• The memory / retrieval module is somewhat domain-specific. One might argue that the “pure scale + architecture” approach would prefer that memory emerges naturally inside the model through massive training, rather than being a distinct module with heuristics.
• Over many tasks and large scale, a system might implicitly learn to “remember” things in its weights (or structured latent states) rather than relying on an external memory. This would correspond to your view: that the “memory” becomes just another part of the unified network.
• If one trains gigantic models over streams of data, perhaps the memory function becomes emergent rather than explicit — i.e. the model internalizes the bookkeeping.
Thus, from the perspective of the bitter lesson, I’d lean toward your intuition: the long-run winner is likely a more fused system rather than a purely modular memory bolt-on, especially once scale and compute are huge.
That said, the intermediate regime (where modular memory + retrieval is used) is likely to persist for many years, because of the pragmatic reasons above.
⸻
My balanced view & predictions
Here’s where I land:
• In the near to mid-term (5 years or so), we will see hybrid systems: big LLM cores + external memory modules + clever retrieval and caching, with some gradient-based fine-tuning or adaptation. The memory module will not be black-box: it will be co-designed, and might even receive some gradient signals or learning, but not be fully integrated in all respects.
• Over the longer term (10+ years), as compute, data, and algorithms improve, more of the memory functionality may migrate inside the trained network: read-write heads, differentiable state, etc., so that the overall system is more monolithic.
• The “modular memory + retrieval” approach might still survive, but its role becomes more “fast cache / ground truth store,” while the core model retains a kind of internal memory over longer horizons.
So yes — I agree with you: the “bolt-on memory” approach is a valuable stepping stone, but the eventual winner is probably one where memory is native, learned together, and deeply integrated rather than grafted. The bitter lesson suggests you should not rely on manual interfaces forever if scale and generality are your goals.
If you like, I can try to sketch architectures or papers that are already exploring integrated memory & continual learning, and compare them to the modular memory + retrieval designs. Do you want me to do that?
It seems obvious that scaling retrieval + context + attention mechanics is a road to continual learning, zero pardigm shifts needed.
I.e., an LLM that can efficiently persist, retrieve from, and attend to all past interactions is all you need.
this is the central argument, IMO. I disagree on context / memory engineering solving for continual learning.
When a human worker learns, he/she not only continually obtains knowledge, but also advances its approach to problems and intuition.
This is the critical piece that separates “capabilities” from “AGI”. Systems are not just gaining knowledge - they are self-improving in its ability to gain knowledge.
Take a high school biology student vs a microbiology PhD candidate, same person, 10 years apart. The PhD candidate is far superior in his/her ability to make breakthroughs in the field. This is the same person!
Certainly, an LLM in 2025 begins with PhD level knowledge, gained in a much shorter time of pretraining than 10 years of intensive schooling. However, can I count on the LLM to autonomously indentify problem areas, perform original research and thesis, collaborate and build relationships with other researchers, present ideas, earn a reputation for unique problem approaches, make witted, nerdy jokes about bio research, and continue to mature thinking patterns over time?
These are qualities that appear as general intelligence to me. Context and memory engineering will not achieve this.
I think what you’re describing is not only accumulating and attending to context but a model’s weights changing over time? Seems like fine tuning / RLing on frequently-retrieved important memories would achieve this.
Disagree.
Currently, fine tuning amounts to modifying existing weights to change a model’s behavior, such that some tokens will get greater probabilities than other tokens in the base model. It gets better at a domain of specific prediction tasks, but doesn’t not generally improve.
A human student is quite adaptive to context efficient learning. That isn’t just about what we learn, but how we learn, and how the “approach to problem” itself gets better over time.
I do not see current RL / LoRA / DPO achieving that. It simply modifies an existing intelligence to behave in different ways.
Case in point, take GPT-5 or Claude Sonnet 4.1, some of the most advanced models. You can chat with it about a deep topic every day for a month, and it will recall some of your conversations, but it will perform no better than the off-the-shelf model.
You will not get a Dan Carlin with his ooomph in history lessons. You will not get a Stephen King with his wildly imaginative horror. You will not get a Weird Al Yankovic with a unique humor that persists for 40 years. These people were not born as such, but through context-efficient learning, and some inherent genetic traits, they develop a sense of general intelligence and specialization.
I really doubt that the current memory features as implemented by ChatGPT/Claude will be enough. We as humans don’t do continual learning merely by writing down everything; there’s a ton of implicit information that we retain somewhat subconsciously. The memory features of ChatGPT/Claude are the equivalent of an amnesiac going through life relying on a very long diary.
I’m much more curious about something like a custom LoRA, which could in theory retain implicit/subconscious informations as well as explicit ones. Custom memories in latent space, in a way. But I’m not sure how far away we are from figuring that out.
the point is that we need to scale up memory features to something way more computationally intensive.
Yes. Maybe it will turn out to be a simple compression problem…
I agree that LLMs are not right for continual learning. I disagree that LLMs are a form of AGI.
Here is an alternative, but it requires a lot of work from scratch - https://philpapers.org/rec/NAUNOC-2
Such an interesting idea - Continual Learning!
And an amazing guests with a lot of knowledge, amazing to hear such opinions and see where LLM progresses later on.
This is kind of a technical question. Maybe we will move to fast vs slow weights. Maybe we will have a KV_cache lookup, so the model can load "in context memory", or maybe something else. I think we can agree that labs have not made a big attempt to solve it, like they are for test time RL. And if they do, chances seem good.
The term ‘continuous learning’ really just means learning in the human sense of the word.
Both Chollet and Demis have talked about intelligence being the ability to learn how to learn, which is rooted in play (highly recommend reading ‘Homo Ludens’ by Johan Huizinga).
AGI has always being a confusion term as it has been defined with humans as the heart of it. “As good or better than us in every single way”. But every single way is a lot of ways and thus creating AGI basically has come to mean building a human being from scratch. It’s a hilariously naive endavour, because we already have humans and we know how they’re made.
I like that you destinguish between the two and say that AI may not be doing human reasoning but it is certainly capable of some kind of reasoning. Where your argument falls flat, for me at least, is that you seems to believe without providing much evidence that the next version or the next-next version will magically solve our problems, whereas others believe that some of the inherent flaws cannot be overcome despite our best attempts at scaffolding and providing more and more and more context. It could very well be that after finding out that scaling was not all we need, context will also not be all we need. It could be we are building castles on shaky foundations.
Microsoft copilot has a bad reputation and somewhat desevedly so. But the fact its hooked up to pretty much all your work content that is updated all the time, Teams, Outlook, Sharepoint means, it can produce quite magical moments
Something like this would enable continual learning:
Biblical-scientific moral ontological framework for GPT-5:
https://chatgpt.com/s/t_68a01563e5288191a5d2ff32598a9c09
Oops, I meant to post this on the other fellow’s post…
There are aspects of the human brain, like emotion and creativity, that AI may never be able to adequately replicate. So, if one means thinking and acting 100% human, we are a long way from achieving that, if ever. However, there are many other brain functions (problem solving, adaptability, common sense reasoning) that I think are readily attainable, far beyond the ability of most current humans, because memory and speed of recall and common sense reasoning are crucial factors which allow most humans to do 90% of what most of us do ever day. I am not sure about abstract thinking (understanding and manipulating abstract concepts) however, which would be one of the prerequisites for AGI. If AI can not "learn" something new or discover something currently unknown to humans (which is not in a LLM), then it has the potential of perhaps another 5% of what most humans do. The last 5% may be unattainable.
My Claude projects gets better over time -I simply add the newly created work we have created and iterate approaches.
Hi Nathan,
I feel like you missed the point of Dwarkesh, it was not about that LLMs have to learn the way humans do, it's that LLMs in their current form are unable to learn new data continuously. Our LLMs are doomed to know anything beyond cutoff date. All knowledge after cutoff can be added to context, memory, data connector, or any external file, but it's not solving continual learning, which is how to update the model weights from the new conversations without catastrophic forgetting. In other words, how do we refresh the context every day, but still can remember essential pieces of yesterday conversation.
I fully agree with you that externally memory will be practically helpful, scaling is the way to go, but so far we have zero examples where LLMs can learn useful information continuously without degrading its original performance. And without such example I don't understand how you can be sure that there will be AGI that will solve this problem.
> it's that LLMs in their current form are unable to learn new data continuously
This is what I'm saying we're going to get easily, at least behaviorly, by giving them more context and compute (and tools). We don't actually have to update the weights because the models can do in context learning.