I am self-employed working on various consulting-like projects - all my files live in a repo and I have 6 screens and 10+ claude code instances open at all times - I build a basic harness / agent-framework that understands to soak up compute for @project_x , then I just see with my own expertise or my network what the next best action is to get a signal for how to progress on a given project. all my communication with this system is via voice - i sometimes record myself for an hour or so talking about my ideas on project_x, then i consult someone i know who knows better to verify my ideas and rapidly iterate to get to a state where i can draft the next pitch / mail / whatsapp / call, whatever. I am now working on 10+ projects and at the same time building my system out further and through these rapid actions and interactions and genuinely institutional-grade work I do, my network expands as well (-> more context, more projects, more systems).
Continual learning will be just building systems that soak up compute and scale with search - you can make your system intelligent enough to make connections you can‘t even see yourself - I am sometimes in awe when claude code refers to an unrelated project and maps this onto my current one. and this is happening NOW. If you have access to experts, are ai-native and know how to start things, go for this too. Even if this won‘t work, you‘ll learn so much that you‘ll be a tier a talent when these systems are not build on claude code but by the big labs inheritently themselves.
One thought I have been having lately is whether I am not indirectly also building a super powerful RL env for labs to buy lmao (i have a lot of idiosyncratic data through the nature of my work). Maybe you can answer that @nathan
just a thought I had when I listened to edwin chen from surge ai describe one of their envs they create synthetically and curate.
but to add to that, I think human work will just change to being a converter of signals, having LLMs to identify these signals and then pursue actions based on your own intuition or expert‘s guidances.
I believe AGI is a milestone that can only be labeled in hindsight.
Looking back, historians will be able to determine a point in time where it turns out to be just scaling up. That day in current time will just glide by as 'just another day'.
Geoff Hinton offers a way to reconcile both of your positions. When asked about the most important problem in AI recently (other than safety), he emphasized that training and inference shouldn’t be viewed as distinct categories, in his view, they’re parts of the same continuous process. As Dwarkesh describes, continual learning represents the training/fine-tuning aspect of weight adjustments and long-term model adaptation. Meanwhile, your use of "memory" aligns more with in-context learning, which focuses on inference using context and prompts. In Hinton’s framework, “learning” spans both approaches, and as algorithms and architectures evolve, both fine-tuning and dynamic context-based learning will remain integral to truly intelligent systems. I am currently working on an interesting way to fuse gradient descent and context engineering/tool usage.
Continual learning and memory is something so basic that not only humans, but all brains in nature have it. An AGI without explicit long-term memory mechanism is a pure illusion because when exploring complex problems and expanding the frontiers of knowledge there is simply no way a system can maintain everything in the short-term memory, i.e. context window. For example even if one could conceive a transformer architecture with an unbounded short-term memory, the system would quick run out of attention layers to process it. Long-term memory acts as an intelligent mechanism to retrieve the information that is needed at any given time, without saturating the short-term memory. So it's not matter of making airplanes like birds, it's matter of recognizing the basic principles of aerodynamics that bird and airplanes alike must obey. In case of computing, the principle is short-term vs long-term memory, and it holds even for Von Neumann architecture, the basis of modern computers.
I really doubt that the current memory features as implemented by ChatGPT/Claude will be enough. We as humans don’t do continual learning merely by writing down everything; there’s a ton of implicit information that we retain somewhat subconsciously. The memory features of ChatGPT/Claude are the equivalent of an amnesiac going through life relying on a very long diary.
I’m much more curious about something like a custom LoRA, which could in theory retain implicit/subconscious informations as well as explicit ones. Custom memories in latent space, in a way. But I’m not sure how far away we are from figuring that out.
- A human-to-human replacement probably would want ~3B Context window as an uppder bound (humans approx have 2-3B heartbeats throughout their avg lifespan)
- Of course in practice you don't need this much, and ~say 2M-20M context is probably sufficient for most tasks, the question is if transformer architecture can scale this much in practice due to O(n^2)
I *think* what dwarkesh is saying that it very hard to get human like *output* from LLMs, not that they necessarily need to learn in the exact same way as humans.
Even with clever context engineering, it’s a spirit of continual learning, but not what our brains do in converting experiences into memories and work that into our actions to keep making us better.
But the problem is gets to a point, and no higher. That point, and how low it is for dwarkesh (and myself), is why his AGI timelines are increasing.
I think only the “spirit” of continual learning exists in LLMs today, because anecdotally I can see the significant improvement for Claude when I give it rules and really well defined context for my codebase, but after that performance is just the same. Even with continuous context additions and rule adds, it barely moves the needle.
Whereas a human on the job would continually learn and understand deeply the nooks and crannies of the codebase and not repeat the same mistakes over and over.
> Whereas a human on the job would continually learn and understand deeply the nooks and crannies of the codebase and not repeat the same mistakes over and over.
Right now there are just bad systems around LMs. Like the gap from claude code to cursor to copilot is all system differences. We're very early on hillclimbing that.
> But the problem is gets to a point, and no higher.
Confused by this. What's the point?
I think its less about how we personally use LMs and more about serious, complex engineering systems in the back end that companies will build for us.
The “point” is capability given context. I don’t believe the systems behind Claude code are bad, they do their job of getting the context it needs, but it’s not *learning* (and I can tell via the change in output quality over time)
I’m curious to why you think the systems (let’s restrict to Claude code) around LMs are bad?
I think claude code is great, but can be made much better with more parallel compute (as some people do) and slightly more expensive models (forcing opus) and the analog doesn't exist outside of code, e.g. for writing and ideas. Is that clearer?
Makes sense. Are you saying that, when we build Claude code like systems around everyday tasks like writing or doing taxes, we will essentially get what dwarkesh calls continuous learning?
I am self-employed working on various consulting-like projects - all my files live in a repo and I have 6 screens and 10+ claude code instances open at all times - I build a basic harness / agent-framework that understands to soak up compute for @project_x , then I just see with my own expertise or my network what the next best action is to get a signal for how to progress on a given project. all my communication with this system is via voice - i sometimes record myself for an hour or so talking about my ideas on project_x, then i consult someone i know who knows better to verify my ideas and rapidly iterate to get to a state where i can draft the next pitch / mail / whatsapp / call, whatever. I am now working on 10+ projects and at the same time building my system out further and through these rapid actions and interactions and genuinely institutional-grade work I do, my network expands as well (-> more context, more projects, more systems).
Continual learning will be just building systems that soak up compute and scale with search - you can make your system intelligent enough to make connections you can‘t even see yourself - I am sometimes in awe when claude code refers to an unrelated project and maps this onto my current one. and this is happening NOW. If you have access to experts, are ai-native and know how to start things, go for this too. Even if this won‘t work, you‘ll learn so much that you‘ll be a tier a talent when these systems are not build on claude code but by the big labs inheritently themselves.
One thought I have been having lately is whether I am not indirectly also building a super powerful RL env for labs to buy lmao (i have a lot of idiosyncratic data through the nature of my work). Maybe you can answer that @nathan
I have no idea who will buy stuff but there's a lot of money flowing around.
Love this.
just a thought I had when I listened to edwin chen from surge ai describe one of their envs they create synthetically and curate.
but to add to that, I think human work will just change to being a converter of signals, having LLMs to identify these signals and then pursue actions based on your own intuition or expert‘s guidances.
I believe AGI is a milestone that can only be labeled in hindsight.
Looking back, historians will be able to determine a point in time where it turns out to be just scaling up. That day in current time will just glide by as 'just another day'.
It seems obvious that scaling retrieval + context + attention mechanics is a road to continual learning, zero pardigm shifts needed.
I.e., an LLM that can efficiently persist, retrieve from, and attend to all past interactions is all you need.
Geoff Hinton offers a way to reconcile both of your positions. When asked about the most important problem in AI recently (other than safety), he emphasized that training and inference shouldn’t be viewed as distinct categories, in his view, they’re parts of the same continuous process. As Dwarkesh describes, continual learning represents the training/fine-tuning aspect of weight adjustments and long-term model adaptation. Meanwhile, your use of "memory" aligns more with in-context learning, which focuses on inference using context and prompts. In Hinton’s framework, “learning” spans both approaches, and as algorithms and architectures evolve, both fine-tuning and dynamic context-based learning will remain integral to truly intelligent systems. I am currently working on an interesting way to fuse gradient descent and context engineering/tool usage.
Continual learning and memory is something so basic that not only humans, but all brains in nature have it. An AGI without explicit long-term memory mechanism is a pure illusion because when exploring complex problems and expanding the frontiers of knowledge there is simply no way a system can maintain everything in the short-term memory, i.e. context window. For example even if one could conceive a transformer architecture with an unbounded short-term memory, the system would quick run out of attention layers to process it. Long-term memory acts as an intelligent mechanism to retrieve the information that is needed at any given time, without saturating the short-term memory. So it's not matter of making airplanes like birds, it's matter of recognizing the basic principles of aerodynamics that bird and airplanes alike must obey. In case of computing, the principle is short-term vs long-term memory, and it holds even for Von Neumann architecture, the basis of modern computers.
I really doubt that the current memory features as implemented by ChatGPT/Claude will be enough. We as humans don’t do continual learning merely by writing down everything; there’s a ton of implicit information that we retain somewhat subconsciously. The memory features of ChatGPT/Claude are the equivalent of an amnesiac going through life relying on a very long diary.
I’m much more curious about something like a custom LoRA, which could in theory retain implicit/subconscious informations as well as explicit ones. Custom memories in latent space, in a way. But I’m not sure how far away we are from figuring that out.
the point is that we need to scale up memory features to something way more computationally intensive.
Yes. Maybe it will turn out to be a simple compression problem…
Some thoughts:
- A human-to-human replacement probably would want ~3B Context window as an uppder bound (humans approx have 2-3B heartbeats throughout their avg lifespan)
- Of course in practice you don't need this much, and ~say 2M-20M context is probably sufficient for most tasks, the question is if transformer architecture can scale this much in practice due to O(n^2)
I *think* what dwarkesh is saying that it very hard to get human like *output* from LLMs, not that they necessarily need to learn in the exact same way as humans.
Even with clever context engineering, it’s a spirit of continual learning, but not what our brains do in converting experiences into memories and work that into our actions to keep making us better.
I still disagree with both of these.
LLMs often produce human quality outputs and the probability is going up.
On the second point, my piece is saying that is irrelevant. The "spirit" part doesn't matter it's about abilities.
But the problem is gets to a point, and no higher. That point, and how low it is for dwarkesh (and myself), is why his AGI timelines are increasing.
I think only the “spirit” of continual learning exists in LLMs today, because anecdotally I can see the significant improvement for Claude when I give it rules and really well defined context for my codebase, but after that performance is just the same. Even with continuous context additions and rule adds, it barely moves the needle.
Whereas a human on the job would continually learn and understand deeply the nooks and crannies of the codebase and not repeat the same mistakes over and over.
I think LMs will be able to do this quite soon:
> Whereas a human on the job would continually learn and understand deeply the nooks and crannies of the codebase and not repeat the same mistakes over and over.
Right now there are just bad systems around LMs. Like the gap from claude code to cursor to copilot is all system differences. We're very early on hillclimbing that.
> But the problem is gets to a point, and no higher.
Confused by this. What's the point?
I think its less about how we personally use LMs and more about serious, complex engineering systems in the back end that companies will build for us.
The “point” is capability given context. I don’t believe the systems behind Claude code are bad, they do their job of getting the context it needs, but it’s not *learning* (and I can tell via the change in output quality over time)
I’m curious to why you think the systems (let’s restrict to Claude code) around LMs are bad?
I think claude code is great, but can be made much better with more parallel compute (as some people do) and slightly more expensive models (forcing opus) and the analog doesn't exist outside of code, e.g. for writing and ideas. Is that clearer?
Makes sense. Are you saying that, when we build Claude code like systems around everyday tasks like writing or doing taxes, we will essentially get what dwarkesh calls continuous learning?