a16z: Can continuous learning cure AI's "amnesia"?

Original authors: Malika Aubakirova, Matt Bornstein, a16z crypto

Original translation by: Deep Tide TechFlow

In Christopher Nolan's Memento, the protagonist Leonard Shelby lives in a fragmented present. Brain damage has caused him to suffer from anterograde amnesia, making him unable to form new memories. Every few minutes, his world resets, trapped in an eternal "now," unable to remember what just happened, nor can he know what will come next. To survive, he tattoos words on his body and uses Polaroids, relying on these external tools to compensate for the memory functions his brain cannot perform.

Large language models also exist in a similar eternal present. After training, vast amounts of knowledge are frozen in parameters, and the model cannot form new memories or update its parameters based on new experiences. To compensate for this shortcoming, we build scaffolding around it: chat history serves as short-term notes, a retrieval system acts as an external notebook, and system prompts are like tattoos on the body. But the model itself has never truly internalized this new information.

More and more researchers believe this is not enough. The problems that in-context learning (ICL) can solve rely on the premise that answers (or fragments of answers) already exist somewhere in the world. However, for those that require true discovery (such as brand new mathematical proofs), adversarial scenarios (such as security offense and defense), or knowledge that is too implicit to be expressed in language, there is ample reason to believe: the model needs a way to directly write new knowledge and experiences into its parameters after deployment.

In-context learning is temporary. True learning requires compression. Until we allow the model to continuously compress, we may be trapped in the eternal present of Memento. Conversely, if we can train the model to learn its own memory architecture instead of relying on external custom tools, we might unlock an entirely new dimension of scaling.

This research area is called continual learning. This concept is not new (see McCloskey and Cohen's 1989 paper), but we believe it is one of the most important research directions in the current AI field. The explosive growth of model capabilities over the past two to three years has made the gap between what models "know" and what they "can know" increasingly evident. The purpose of this article is to share what we have learned from leading researchers in this field, help clarify the different paths of continual learning, and promote the development of this topic in the entrepreneurial ecosystem.

Note: The formation of this article owes much to deep interactions with a group of outstanding researchers, PhD students, and entrepreneurs who generously shared their work and insights in the field of continual learning with us. From theoretical foundations to the engineering realities of learning after deployment, their insights have made this article much more substantial than if we had written it alone. Thank you for your time and ideas!

Let's talk about context first

Before defending parameter-level learning (i.e., learning that updates model weights), it is necessary to acknowledge a fact: in-context learning does indeed work. Furthermore, there is a strong argument that it will continue to succeed.

The essence of the Transformer is a sequence-based conditional next-token predictor. Given the correct sequence, you can achieve surprisingly rich behaviors without touching the weights at all. This is why context management, prompt engineering, instruction fine-tuning, and few-shot examples are so powerful. Intelligence is encapsulated in static parameters, while exhibited capabilities drastically change with the content you feed into the window.

The recent deep article by Cursor on scaling autonomous programming agents is a good example: the model weights are fixed, and what truly makes the system run is the careful orchestration of context—what to put in, when to summarize, and how to maintain coherence during hours of autonomous operation.

OpenClaw is another good example. It became popular not because of special model permissions (which underlying models are available to everyone), but because it has translated context and tools into a working state extremely efficiently: tracking what you are doing, structuring intermediate outputs, deciding when to reinject prompts, and maintaining a durable memory of previous work. OpenClaw has elevated the "shell design" of agents to the level of an independent discipline.

When prompt engineering first emerged, many researchers were skeptical that "just relying on prompts" could become a serious interface. It seemed like a hack. But it is the original product of the Transformer architecture, requiring no retraining and automatically upgrading with model improvements. As the model grows stronger, the prompts become stronger. "Rudimentary but native" interfaces often win because they couple directly with the underlying system, rather than working against it. So far, the development trajectory of LLMs has been just that.

State Space Models: The Steroid Version of Context

As mainstream workflows shift from raw LLM calls to agent cycles, the pressure on context learning models is increasing. In the past, it was relatively rare for the context window to be completely filled. This usually occurs when LLMs are asked to complete a long list of discrete tasks, and the application layer can trim and compress chat histories in a more straightforward manner.

However, for agents, a single task may consume a large portion of the total available context. Every step of the agent cycle relies on context carried forward from previous iterations. They often fail after 20 to 100 steps due to "losing the thread": the context becomes full, coherence degrades, and convergence is lost.

As a result, major AI labs are now investing significant resources (i.e., large-scale training runs) to develop models with ultra-long context windows. This is a natural path, as it builds on already effective methods (context learning) and aligns with the industry's broader trend toward shifting computation during inference. The most common architecture intersperses fixed memory layers among standard attention heads, namely state space models (SSM) and linear attention variants (collectively referred to as SSM hereafter). SSM provides fundamentally better scaling curves in long context scenarios.

Figure caption: Comparison of SSM with traditional attention mechanisms in terms of scaling

The goal is to help agents increase the number of coherent running steps by several orders of magnitude, from about 20 steps to approximately 20,000 steps, while not losing the broad skills and knowledge offered by traditional Transformers. If successful, this would be a significant breakthrough for long-running agents.

You might even view this approach as a form of continual learning: although model weights are not updated, an external memory layer is introduced that requires almost no resets.

Thus, these non-parameterized methods are real and powerful. Any assessment of continual learning must start from here. The issue is not whether today's context system is useful; it certainly is. The question is: have we already hit the ceiling, and can new methods take us further?

What Context Misses: The "File Cabinet Fallacy"

"What happens with AGI and pre-training is that, in a sense, they over-tune... Humans are not AGI. Yes, humans do have a skill base, but they lack a large body of knowledge. What we rely on is continual learning.

If I create a super-smart 15-year-old adolescent, they know nothing. A good student, very eager to learn. You could say, go be a programmer, go be a doctor. Deployment itself will involve some learning, trial and error. It's a process, not just throwing a finished product out there." —Ilya Sutskever

Imagine an infinitely spacious system. The world's largest filing cabinet, where every fact is perfectly indexed and can be retrieved instantly. Can it find anything? Has it learned?

No. It has never been compelled to compress.

This is the core of our argument, citing a previously raised point by Ilya Sutskever: LLMs are essentially compression algorithms. During training, they compress the internet into parameters. Compression is lossy, and it is precisely this lossiness that makes them powerful. Compression forces the model to look for structure, generalize, and build representations that can transfer across contexts. A model that rote-memorizes all training samples is inferior to one that extracts underlying patterns. Lossy compression is learning.

Ironically, the very mechanism that makes LLMs so powerful during training (compressing raw data into compact, transferable representations) is exactly what we refuse to let them continue to do after deployment. At the moment of release, we stopped allowing compression, replacing it with external memory.

Of course, most agent shells will compress context in some customized manner. But isn’t the bitter lesson telling us that the model itself should learn this compression, directly and at scale?

Yu Sun shared an example to illustrate this debate: mathematics. Take Fermat's Last Theorem. For over 350 years, no mathematician could prove it, not due to a lack of relevant literature, but because the solution was highly novel. The conceptual distance between existing mathematical knowledge and the eventual answer was too great.

When Andrew Wiles finally solved it in the 1990s, he worked nearly in isolation for seven years and had to invent entirely new techniques to reach the answer. His proof relied on successfully bridging two different branches of mathematics: elliptic curves and modular forms. Although Ken Ribet had previously proved that establishing this connection would automatically resolve Fermat's Last Theorem, before Wiles, no one possessed the theoretical tools capable of actually constructing this bridge. A similar argument can be made for Grigori Perelman’s proof of the Poincaré conjecture.

The core question is: do these examples prove that LLMs lack something—an ability to update priors and engage in truly creative thinking? Or does this story happen to support the opposite conclusion—that all human knowledge is simply data for training and reorganization, and Wiles and Perelman merely demonstrated what LLMs could also achieve on a larger scale?

This question is empirical, and the answer remains uncertain. But we do know that there are many categories of problems for which in-context learning will fail today, while parameter-level learning may prove useful. For example:

Figure caption: Categories of problems where in-context learning fails and parameter learning may win

More importantly, in-context learning can only handle things that can be expressed in language, while weights can encode concepts that prompts cannot communicate in words. Some pattern dimensions are too high, too implicit, or too structurally deep to fit into context. For example, the visual texture that distinguishes benign artifacts from tumors in medical scans or the subtle audio variations that define a speaker’s unique rhythm are patterns that are difficult to break down into precise vocabulary.

Language can only approximate them. No matter how long the prompts, they cannot convey these things; this type of knowledge can only exist in the weights. It lives in the latent space of learned representations, not in words. Regardless of how large the context window grows, there will always be some knowledge that cannot be described textually and can only be carried by parameters.

This may explain why explicit "Remember you" functions (such as ChatGPT's memory) often leave users feeling uneasy rather than impressed. What users truly desire is not "recall" but "ability." A model that has internalized your behavior patterns can generalize to new contexts; a model that merely recalls your historical records cannot. The gap between "This is what you wrote the last time you replied to this email" (verbatim repetition) and "I have enough understanding of your thinking style to anticipate what you need" is the gap between retrieval and learning.

Introduction to Continual Learning

There are various paths to continual learning. The dividing line is not whether there is a memory function, but rather: where does compression occur? These paths distribute along a spectrum, from no compression (pure retrieval, frozen weights) to complete internal compression (weight-level learning, making the model smarter), with an important intermediate zone (modules).

Figure caption: Three paths of continual learning—context, modules, weights

Context

On the context end, teams build smarter retrieval pipelines, agent shells, and prompt orchestration. This is the most mature category: infrastructure is validated, and deployment paths are clear. The limitation is depth: context length.

One notable new direction is multi-agent architectures as a scaling strategy for context itself. If a single model is limited to a 128K token window, a coordinated group of agents—each holding its own context, focused on a slice of the problem, communicating results to each other—can collectively approximate unlimited working memory. Each agent performs in-context learning within its own window; the system aggregates. Cursor's recent autoresearch project and examples of browsers being built by Cursor are early cases. This is purely a non-parameterized method (with no weight modifications), but it significantly raises the upper limit of what context systems can achieve.

Modules

In the module space, teams build pluggable knowledge modules (compressed KV caches, adapter layers, external memory stores) that allow general models to specialize without retraining the core weights. A model with 8B parameters combined with appropriate modules can match the performance of a 109B model on target tasks, with memory usage being just a fraction. The appeal lies in its compatibility with existing Transformer infrastructure.

Weights

On the weight update end, researchers are pursuing true parameter-level learning: updating sparse memory layers for relevant parameter segments, optimizing the model through reinforcement learning loops based on feedback, and training with context compression during inference (test-time training). These are the deepest methods, and also the most difficult to deploy, but they genuinely allow the model to fully internalize new information or new skills.

There are various specific mechanisms for parameter updates. Here are several research directions:

Figure caption: Overview of research directions in weight-level learning

Weight-level research encompasses multiple parallel routes. Regularization and weight space methods have the longest history: EWC (Kirkpatrick et al., 2017) penalizes parameter changes based on the importance of parameters to prior tasks; weight interpolation (Kozal et al., 2024) mixes new and old weight configurations in parameter space, but both tend to be fragile at scale.

Test-time Training was pioneered by Sun et al. (2020) and later evolved into architectural primitives (TTT layers, TTT-E2E, TTT-Discover), with a fundamentally different approach: performing gradient descent on test data and compressing new information into parameters at the moment it is needed.

Meta-learning asks: can we train models that know "how to learn"? From MAML's sample-efficient parameter initialization (Finn et al., 2017) to Behrouz et al.'s nested learning (Nested Learning, 2025), which structures models as a hierarchical optimization problem, running fast adaptations and slow updates at different time scales, inspired by biological memory consolidation.

Distillation retains knowledge of prior tasks by getting student models to match frozen teacher checkpoints. LoRD (Liu et al., 2025) makes distillation efficient enough to run continuously by simultaneously pruning the model and replaying buffers. Self-distillation (SDFT, Shenfeld et al., 2026) flips the source, using the model's own outputs under expert conditions as the training signal, bypassing the catastrophic forgetting of sequence fine-tuning.

Recursive Self-Improvement operates on a similar idea: STaR (Zelikman et al., 2022) guides reasoning ability from self-generated reasoning chains; AlphaEvolve (DeepMind, 2025) discovers algorithm optimizations that have not been improved for decades; Silver and Sutton's "Experience Age" (2025) defines agent learning as a never-ending flow of continuous experiences.

These research directions are converging. TTT-Discover has already integrated test-time training with RL-driven exploration. HOPE nests fast and slow learning loops within a single architecture. SDFT turns distillation into a fundamental operation of self-improvement. The boundaries between rows are becoming blurred. The next generation of continual learning systems will likely combine multiple strategies: using regularization for stability, leveraging meta-learning for acceleration, and incorporating self-improvement for compounding value. An increasing number of startups are betting on different levels of this tech stack.

The Landscape of Continual Learning Startups

The non-parameter end of the spectrum is the most well-known. Shell companies (Letta, mem0, Subconscious) build orchestration layers and scaffolds to manage the content fed into context windows. External storage and RAG infrastructures (e.g., Pinecone, xmemory) provide retrieval backbones. Data exists; the challenge is to put the right slice in front of the model at the right time. As context windows expand, the design space of these companies also grows, especially on the shell end, where a new wave of startups is emerging to manage increasingly complex context strategies.

The parameter end is earlier and more diverse. Companies here are experimenting with some version of "post-deployment compression," allowing models to internalize new information in weights. The paths can be broadly framed into several different bets on how models should "learn" after release.

Partial compression: learning without retraining. Some teams are building pluggable knowledge modules (compressed KV caches, adapter layers, external memory stores) that allow general models to specialize without touching core weights. The common argument is: you can achieve meaningful compression (not just retrieval) while keeping the stability-plasticity trade-off manageable, as learning is isolated rather than disseminated throughout the entire parameter space. An 8B model equipped with suitable modules can match the performance of much larger models on target tasks. The benefits include composability: modules can plug and play with the existing Transformer architecture, can be independently swapped or updated, and experimental costs are far lower than retraining.

RL and feedback loops: learning from signals. Other teams are betting that the richest signals for post-deployment learning already exist in the deployment loop itself—user corrections, task successes or failures, and reward signals from real-world outcomes. The core idea is that the model should treat every interaction as a potential training signal, not just an inference request. This is highly similar to how humans improve at work: doing work, getting feedback, internalizing which methods work. The engineering challenge lies in converting sparse, noisy, sometimes adversarial feedback into stable weight updates while avoiding catastrophic forgetting. But a model that can genuinely learn from deployment will generate compounding value in ways that the context system cannot.

Data-centric: learning from the right signals. A related but distinct bet is that the bottleneck doesn't lie in learning algorithms, but rather in training data and surrounding systems. These teams focus on filtering, generating, or synthesizing the right data to drive continual updates: assumptions are that a model with high-quality, well-structured learning signals requires far fewer gradient steps to improve meaningfully. This naturally connects with feedback loop companies but emphasizes upstream questions: whether a model can learn is one thing; what it should learn from and to what extent is another.

New architecture: learning capabilities from foundational design. The most radical bet posits that the Transformer architecture itself is a bottleneck and that continual learning requires fundamentally different computational primitives: architectures equipped with continuous-time dynamics and built-in memory mechanisms. The argument here is structural: if you want a continual learning system, you should embed learning mechanisms into the underlying foundational infrastructure.

Figure caption: The landscape of continual learning startups

All major labs are also actively laying out in these categories. Some are exploring better context management and reasoning chains, while others are experimenting with external memory modules or sleep-time computation pipelines, and several stealth companies are pursuing new architectures. This field is early enough that no single method has emerged as a winner, and given the broad diversity of use cases, there should not be just one winner.

Why Naive Weight Updates Fail

Updating model parameters in production environments triggers a series of failure modes that are currently unresolved at scale.

Figure caption: Failure modes of naive weight updates

Engineering problems are well documented. Catastrophic forgetting means that a model sufficiently sensitive to learn from new data will destroy existing representations—the stability-plasticity dilemma. Temporal decoupling signifies that invariant rules and variable states are compressed into the same set of weights, so updating one damages the other. The failure of logical integration occurs because factual updates will not propagate to their inferences: changes are limited to the token sequence level, not the semantic concept level. Unlearning remains impossible: there is no differentiable subtraction operation, so false or toxic knowledge lacks precise surgical excision methods.

There is also a second class of problems that has received less attention. The separation of current training from deployment is not merely a matter of engineering convenience; it marks the boundaries of safety, auditability, and governance. Opening this boundary may lead several issues to arise concurrently. Safety alignment may degrade unpredictably: even fine-tuning on benign data within a narrow range may produce widespread misbehavior.

Continuous updates create a data poisoning attack surface—a slow, persistent version of prompt injection—but one that lives within the weights. Auditability collapses, as a continuously updating model is a moving target, unable to perform version control, regression testing, or one-time certification. When user interactions are compressed into parameters, privacy risks escalate, as sensitive information gets baked into representations, making it harder to filter than information retrieved from context.

These are open questions, not fundamentally impossible ones. Solving them is part of the research agenda for continual learning, just as addressing core architectural challenges is.

From "Memento" to True Memory

Leonard's tragedy in Memento is not that he cannot function—in any scene, he is clever, even remarkable. His tragedy lies in the fact that he can never compound value. Every experience remains external—a Polaroid, a tattoo, a note in someone else's handwriting. He can retrieve, but he cannot compress new knowledge.

As Leonard navigates this self-constructed labyrinth, the boundary between reality and belief begins to blur. His condition does not just rob him of memory; it forces him to constantly reconstruct meaning, making him both a detective and an unreliable narrator in his own story.

Today's AI operates under the same constraints. We have built very powerful retrieval systems: longer context windows, smarter shells, coordinated multi-agent groups, and they work. But retrieval is not the same as learning. A system that can look up any fact has not been compelled to seek structure. It has not been compelled to generalize. The mechanism that makes training so powerful—transforming raw data into transferable representations—is exactly what we turned off at the moment of deployment.

The path forward is likely not a single breakthrough but a layered system. In-context learning will remain the first line of adaptive defense: it is native, validated, and continuously improving. Module mechanisms can handle the intermediate zones of personalization and domain specialization.

But for those truly challenging problems—discovery, adversarial adaptation, and latent knowledge that cannot be expressed in words—we may need to allow models to continue compressing experiences into parameters after training. This means advances in sparse architectures, meta-learning objectives, and self-improvement loops. It might also require us to redefine what "models" mean: not a set of fixed weights, but an evolving system with its memories, its update algorithms, and its ability to abstract from its experiences.

The filing cabinet is getting larger. But no matter how big the filing cabinet becomes, it’s still just a filing cabinet. The breakthrough lies in letting models do what made them powerful during training after deployment: compressing, abstracting, learning. We stand at the turning point from forgetful models to those that carry a glimmer of experiential knowledge. If not, we will be trapped in our own Memento.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。