Memory in the Age of AI Agents: A Survey¶
Forms, Functions and Dynamics
ArXiv: 2512.13564
đŻ Pitch¶
This survey introduces a unified taxonomy for agent memory across three axesâforms (token-level, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval)âclarifying inconsistent terminology and linking architectures to evaluation practices. By reframing memory as a firstâclass, operational substrate for longâhorizon reasoning, continual adaptation, and multiâagent/multimodal interaction, it provides researchers and practitioners a coherent foundation to design, benchmark, and advance memoryâenabled AI agents with realâworld longevity and reliability.
1. Executive Summary (2-3 sentences)¶
This survey builds a unified, up-to-date taxonomy for âmemoryâ in foundation-model-based AI agents by organizing the space along three axesâforms, functions, and dynamicsâto resolve fragmented terminology and mismatched evaluation practices. It matters because memory is positioned as the enabling substrate for long-horizon reasoning, continual adaptation, and coherent interaction across sessions, and existing âshort/long-term memoryâ labels are argued to be too coarse for modern agent systems.
2. Context and Motivation¶
- What specific problem/gap is addressed?
- The survey targets conceptual fragmentation in âagent memoryâ research: works differ in motivation, implementation, assumptions, and evaluation protocols, while memory terms (e.g., episodic/semantic/parametric) are used inconsistently.
-
It argues traditional taxonomies (especially âshort-term vs long-term memoryâ) do not capture the diversity and dynamics of memory mechanisms now used in LLM-based agents.
-
Why is this problem important?
- Memory is framed as a core capability for agents that must act over time, interact with environments, and adapt (Introduction).
-
The survey highlights application areas that depend on persistent memory-like behavior (e.g., personalized chatbots, recommender systems, social simulations, financial investigationsâIntroduction), and also positions memory as foundational for continual evolution through environment interaction.
-
What prior approaches existed, and where do they fall short?
- Prior surveys exist, but the paper claims their taxonomies were developed before rapid advances (notably in 2025), leaving emerging directions underrepresented (Section âAgent Memory Needs A New Taxonomyâ).
-
A key shortcoming is over-reliance on temporal labels (short/long-term) rather than distinguishing what is stored, what it is used for, and how it evolves.
-
How does this survey position itself relative to existing work?
- It explicitly proposes a three-lens framework:
Forms= where/how memory is represented (Section 3; Figure 1).Functions= why memory is needed (Section 4; Figure 6).Dynamics= how memory is formed/evolved/retrieved (Section 5; Figure 8).
- It also delineates scope by distinguishing agent memory from adjacent concepts like
LLM memory,RAG, andcontext engineering(Section 2.3; Figure 2).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The âsystemâ here is a survey framework: a structured way to classify and reason about memory mechanisms used in LLM-based agents.
- It solves the âshapeâ of the problem by giving (i) formal definitions, (ii) a taxonomy across representation and purpose, and (iii) a lifecycle model (formation â evolution â retrieval) for how memory operates over time.
3.2 Big-picture architecture (diagram in words)¶
- Box 1: Formal agent + environment loop (Section 2.1) â defines state
s_t, observationso_t^i, actionsa_t, and trajectoriesĎ. - Box 2: Memory state
M_t(Section 2.2) â a generic container (text buffer / KV store / vector DB / graph / hybrid). - Box 3: Memory operators (Section 2.2):
F: formation (extract/store useful artifacts),E: evolution (consolidate/update/forget/restructure),R: retrieval (produce memory signalm_t^ifor the agent).- Box 4: Taxonomies layered on top (Sections 3â5):
Forms(token-level / parametric / latent),Functions(factual / experiential / working),Dynamics(formation / evolution / retrieval pipelines).
3.3 Roadmap for the deep dive¶
- I first explain the paperâs formal definitions for agents and memory (
Ď,M_t,F/E/R) because later taxonomies reuse these concepts (Section 2). - Next I describe scope boundaries vs.
LLM memory,RAG, andcontext engineeringto avoid conflating related work (Section 2.3; Figure 2). - Then I unpack the Forms taxonomy (Section 3; Figures 3â5), since it answers âwhat carries memory?â
- After that, I cover Functions (Section 4; Figures 6â7), which answers âwhy does an agent need memory?â
- Finally, I walk through Dynamics (Section 5; Figures 8â10; Tables 7) as the operational lifecycle tying everything together.
3.4 Detailed, sentence-based technical breakdown¶
This is a survey/taxonomy paper with a formalization component; its core idea is that âmemory for agentsâ should be understood through (i) representation forms, (ii) behavioral functions, and (iii) lifecycle dynamics, rather than only short/long-term time horizons (Figure 1; Sections 3â5).
3.4.1 Formalizing LLM-based agent systems (Section 2.1)¶
- The paper models an agent system (single- or multi-agent) as interaction with an environment state space
S, where the environment transitions via a controlled stochastic dynamics model: s_{t+1} ~ Ψ(s_{t+1} | s_t, a_t).- Each agent
ireceives an observation: o_t^i = O_i(s_t, h_t^i, Q),- where
h_t^iis the visible history portion andQis a task specification (instruction/goal/constraints). - Actions are explicitly heterogeneous (not just text), including natural language, tool invocation, planning outputs, environment control, and communication actions (Section 2.1).
- The policy is written as:
a_t = Ď_i(o_t^i, m_t^i, Q),- where
m_t^iis a memory-derived signal defined next.
3.4.2 Formalizing âagent memory systemsâ via a unified memory state and operators (Section 2.2)¶
- Memory is represented as an evolving state:
M_t â M,- where
Mis the set of admissible memory configurations and no specific structure is assumed (it could be text, vectors, a graph, etc.). - Memory is integrated into the agent loop through three conceptual operators:
- Formation: the agent produces artifacts
Ď_t(tool outputs, reasoning traces, plans, self-evals, environment feedback), and the formation operator selectively transforms them:M_{t+1}^{form} = F(M_t, Ď_t).
- Evolution: formed candidates are integrated/cleaned/restructured:
M_{t+1} = E(M_{t+1}^{form}),- with examples including consolidating redundancy, resolving conflicts, discarding low-utility items, or restructuring for retrieval (Section 2.2).
- Retrieval: at decision time, the agent retrieves a context-dependent signal:
m_t^i = R(M_t, o_t^i, Q),- and
m_t^iis formatted for LLM consumption (snippets/summary/structured content).
- A key modeling choice is that short-term vs long-term effects are not forced into separate modules: they can âemergeâ from invocation patterns of
F/E/R(e.g., retrieve only att=0, or continuously), rather than architectural separation (Section 2.2).
Worked micro-example (illustrative, based on the paperâs operators)¶
- Suppose an agent is doing a multi-step web task.
- At step
t, it calls a tool and receives output plus feedback; these are part ofĎ_t. F(M_t, Ď_t)might extract a concise âlessonâ (e.g., a successful navigation pattern) rather than storing the entire raw log.E(...)might merge that lesson into an existing set of similar lessons and remove duplicates.- Later, when asked a related question,
R(M_t, o_t^i, Q)retrieves the most relevant lesson(s) asm_t^ito guide the next action.
3.4.3 Scope boundaries: agent memory vs. neighboring concepts (Section 2.3; Figure 2)¶
The survey argues many confusions come from overlapping implementations but different goals:
- Agent memory vs
LLM memory(Section 2.3.1) - The paper says âagent memoryâ largely subsumes what used to be called âLLM memoryâ in 2023â2024 (e.g., early systems framed as giving LLMs memory but functionally enabling agent-like persistence across interactions).
-
It draws a boundary: work focused on internal model mechanisms (e.g., KV cache management, long-context architectures, attention sparsity) is categorized as
LLM memoryand often outside the scope ofagent memorybecause it does not necessarily provide cross-task persistence or deliberateF/E/Roperations. -
Agent memory vs
RAG(Section 2.3.2) - Both use retrieval stacks (vector indices, semantic search, graphs), but classical
RAGis framed as augmenting an LLM with static external knowledge for single tasks, while agent memory is framed as a persistent, self-evolving internal memory base built from the agentâs own interactions and feedback. -
The paper notes the boundary is increasingly blurred (e.g., dynamic retrieval, systems interpreted as memory), so it also distinguishes them pragmatically by evaluation domains:
RAGoften appears in multi-hop QA benchmarks (the survey lists examples), while agent memory is often tested in sustained interactive settings and agent benchmarks (Section 2.3.2). -
Agent memory vs
context engineering(Section 2.3.3) Context engineeringis treated as optimizing the context window as a resource, whereas agent memory is treated as maintaining a persistent cognitive state (what the agent knows/has experienced) across tasks.- The overlap is especially strong in
working memoryand long-horizon interaction, where both rely on compression, organization, and selection (Section 2.3.3).
3.4.4 Forms: what carries memory? (Section 3; Figures 3â5)¶
The surveyâs first major taxonomy is representational:
Token-level memory(Section 3.1)- Defined as storing memory in explicit, discrete, externally accessible units (âtokensâ broadly includes text and other discrete modality units).
- The key sub-taxonomy is by topological organization (Figure 3):
Flat (1D): no explicit inter-unit structure; a list/bag of chunks/dialogue/experiences/summaries (Section 3.1.1; Table 1).Planar (2D): explicit single-layer structure like trees/graphs/tables but without multi-layer hierarchy (Section 3.1.2; Table 1).Hierarchical (3D): multi-layer structures enabling coarse-to-fine abstraction and cross-layer reasoning (Section 3.1.3; Table 1).
-
The survey repeatedly stresses the trade-off: token-level memory is transparent, editable, auditable, but can suffer from noise/redundancy and relies heavily on retrieval quality as it scales (Section 3.1 âDiscussionâ).
-
Parametric memory(Section 3.2; Table 2) - Memory is stored in model parameters, accessed implicitly during forward computation.
- Two types are distinguished (Section 3.2):
Internal parametric memory: memory injected into original weights; categorized further by training phase (pre-train / mid-train / post-train) in Table 2.External parametric memory: memory stored in additional parameter sets (e.g., adapters/LoRA modules or auxiliary models) without changing base weights (Section 3.2.2; Table 2).
-
The survey highlights typical trade-offs: lower inference overhead but harder updates, and risks like interference/catastrophic forgetting (Section 3.2 âDiscussionâ).
-
Latent memory(Section 3.3; Figure 4; Table 3) - Defined as memory held in internal representations (KV cache, activations, hidden states, latent embeddings) rather than explicit tokens or parameters.
- Organized by the âoriginâ of latent state (Figure 4; Section 3.3):
Generate: auxiliary module produces latent memory units reused later (Section 3.3.1; Table 3).Reuse: carry over internal computational states such as KV caches (Section 3.3.2; Table 3).Transform: compress/reshape internal state via selection/merging/projection (Section 3.3.3; Table 3).
-
The survey frames latent memory as token-efficient and suitable for multimodal fusion, but less interpretable/debuggable (Section 3.3 discussions; Figure 5).
-
Adaptation guidance across forms (Section 3.4; Figure 5)
- Figure 5 explicitly maps each form to âfeaturesâ and âsuitable applicationsâ:
- Token-level: symbolic/transparent; good for chatbots, personalization, recommender systems, high-stakes domains.
- Parametric: implicit/generalizable; good for role-playing and tasks requiring fundamentally new capabilities.
- Latent: machine-native/token-efficient; good for multimodal memory and on-device/edge or low-resource settings.
3.4.5 Functions: why agents need memory? (Section 4; Figure 6)¶
The second major taxonomy is functional, moving beyond time-based labels:
- The survey defines three primary functional pillars (Figure 6; Section 4):
Factual memory: declarative facts about user/environment; answers âWhat does the agent know?â (Section 4.1).Experiential memory: procedural/strategic knowledge distilled from success/failure and trajectories; answers âHow does the agent improve?â (Section 4.2; Figure 7).Working memory: capacity-limited workspace for within-episode context management; answers âWhat is the agent thinking about now?â (Section 4.3).
Key sub-structures the paper introduces:
- Factual memory splits by target entity (Table 4):
- User factual memory (dialogue coherence, goal consistency; Section 4.1.1).
- Environment factual memory (knowledge persistence, shared access for multi-agent collaboration; Section 4.1.2).
- Experiential memory splits by abstraction level (Figure 7; Table 5; Section 4.2):
- Case-based: raw trajectories/solutions as exemplars (Section 4.2.1).
- Strategy-based: distilled insights/workflows/patterns (Section 4.2.2).
- Skill-based: executable capabilities (code snippets, functions/scripts, APIs, MCPs) (Section 4.2.3).
- Plus Hybrid mixtures (Section 4.2.4).
- Working memory splits by interaction dynamics (Section 4.3; Table 6):
- Single-turn: input condensation + observation abstraction (Section 4.3.1).
- Multi-turn: state consolidation + hierarchical folding + cognitive planning (Section 4.3.2).
3.4.6 Dynamics: how memory operates and evolves over time (Section 5; Figures 8â10; Table 7)¶
The third axis models memory as a lifecycle:
- Figure 8 frames the operational loop as:
- Memory Formation (Section 5.1): extract compact knowledge from raw data.
- Memory Evolution (Section 5.2): integrate new memories into the memory base (consolidate/update/forget).
- Memory Retrieval (Section 5.3): decide when/what/how to retrieve and how to post-process results.
- The survey then classifies techniques within each stage:
- Formation has five categories (Section 5.1; Table 7):
Semantic summarization(incremental vs partitioned; Section 5.1.1).Knowledge distillation(factual vs experiential; Section 5.1.2).Structured construction(entity-level vs chunk-level; Section 5.1.3).Latent representation(textual vs multimodal latent; Section 5.1.4).Parametric internalization(knowledge vs capability internalization; Section 5.1.5).
- Evolution has three mechanisms (Section 5.2; Figure 9):
Consolidation(local / cluster-level / global integration; Section 5.2.1).Updating(external memory update vs model editing; Section 5.2.2).Forgetting(time-based / frequency-based / importance-driven; Section 5.2.3).
- Retrieval is decomposed into four steps (Section 5.3; Figure 10):
- timing & intent,
- query construction (decomposition/rewriting),
- retrieval strategies (lexical/semantic/graph/generative/hybrid),
- post-retrieval processing (re-ranking/filtering; aggregation/compression).
3.4.7 Core configurations and hyperparameters (required element)¶
- This paper is a survey and does not present a single trained model or unified training run, so global training hyperparameters (optimizer, learning rate, batch size, total tokens, hardware) are not applicable as a paper-level configuration.
- Where optimization details are part of the taxonomy, the survey explicitly names representative optimization approaches at the method level (e.g., RL methods like
PPOandGRPOappear in the discussion of RL-optimized summarization in Section 5.1.1 and in Table 7, and multiple tables include an âOptimizationâ column such as Tables 2, 4, 5, 6).
4. Key Insights and Innovations¶
- A unified âFormsâFunctionsâDynamicsâ taxonomy for agent memory (Figure 1; Sections 3â5).
- Novelty: Rather than labeling memory by âshort vs long term,â it triangulates memory by representation, purpose, and lifecycle behavior.
-
Significance: This framing directly addresses fragmentation by letting two systems that both call themselves âmemoryâ be compared along explicit axes (e.g., token-level working memory vs latent working memory; factual vs experiential).
-
Clear scoping distinctions among
agent memory,LLM memory,RAG, andcontext engineering(Section 2.3; Figure 2). - Novelty: The survey does not treat these as synonymous; it explains overlap in tooling (e.g., retrieval stacks) while separating them by whether the system maintains a persistent, self-evolving cognitive state (agent memory) vs. internal architecture optimizations (LLM memory) vs. static external grounding (classical RAG) vs. context window resource management (context engineering).
-
Significance: This is practically important because evaluation claims and design choices differ depending on which category youâre actually in.
-
A representational topology taxonomy within token-level memory: flat (1D) â planar (2D) â hierarchical (3D) (Section 3.1; Figure 3; Table 1).
- Novelty: Memory is categorized by structural complexity and how that affects retrieval and reasoning, not just by âepisodic/semanticâ labels.
-
Significance: It helps explain why certain agents shift from simple vector stores to graphs/trees/hierarchies as tasks demand multi-hop reasoning and abstraction.
-
A functional breakdown into
factual,experiential, andworkingmemory (Section 4; Figure 6), with experiential memory further split by abstraction level (Figure 7; Table 5). - Novelty: It explicitly distinguishes âknowing factsâ from âlearning how to do thingsâ from âmanaging in-episode workspace,â and connects these to different design patterns (e.g., skill memory as APIs/MCPs vs trajectories).
-
Significance: This clarifies what to store and how to evaluate: a benchmark testing preference recall should not be treated as measuring âexperiential learning.â
-
A lifecycle / operational view (
F/E/R) plus detailed decompositions of formation/evolution/retrieval (Section 2.2; Section 5; Figures 8â10; Table 7). - Novelty: The survey turns memory into an explicit process model with operators, not just a storage component.
- Significance: It gives a common language for implementing memory managers and comparing systems that differ mainly in when/how they write, consolidate, forget, and retrieve.
5. Experimental Analysis¶
Because this is a survey, there is no single experimental setup whose results validate a new model; instead, the empirical contribution is a structured compilation of benchmarks and frameworks, plus taxonomies and comparisons.
- Evaluation methodology covered by the survey
- The survey compiles and categorizes benchmarks for memory/lifelong/self-evolving agents and also âotherâ benchmarks that stress memory implicitly (Section 6.1; Table 8).
-
It also compiles open-source memory frameworks and compares their supported memory types and structures (Section 6.2; Table 9).
-
Main quantitative details explicitly provided
- Table 8 reports benchmark âScaleâ values (samples/tasks) for many datasets. Examples visible in the provided excerpt include:
MemBench: 53,000 samples (Table 8).LoCoMo: 300 samples (Table 8).MT-Mind2Web: 720 samples (Table 8).LongBench: 21 tasks / 4,750 samples (Table 8).MM-Needle: approximately 280,000 samples (Table 8).HotpotQA: 113k samples (Table 8).
-
These are presented as catalog metadata, not as performance results.
-
Baselines and comparisons
- The paper includes comparison tables for methods (e.g., Table 1 token-level methods; Table 2 parametric memory methods; Tables 4â6 functional method taxonomies).
-
These comparisons are categorical (carrier, structure, task domain, optimization type), not score-based.
-
Do experiments support claims?
- The surveyâs claims are primarily organizational and definitional, and the provided content supports those through:
- formal definitions (Section 2),
- taxonomies with diagrams (Figures 1â11),
- method listings and structured tables (Tables 1â7, 8â9).
-
There is no single âclaim of improved accuracyâ that would require ablations; instead, the âsupportâ is the breadth and structured mapping of the literature.
-
Ablations / robustness checks / failure cases
- Not applicable at the paper level: the survey does not introduce a new algorithm evaluated via ablations.
- It does, however, discuss conceptual failure modes (e.g., retrieval noise, silent failure when retrieval is not triggered, stabilityâplasticity dilemma in updates, long-tail loss in forgetting) in Sections 5.2â5.3.
6. Limitations and Trade-offs¶
- Limited by survey scope and the moving target of the literature
-
The paper itself flags the rapid expansion of the field and invites readers to report missing papers (Introduction note), implying inevitable incompleteness.
-
Taxonomy boundaries are not perfectly separable
- The paper explicitly notes blurred boundaries, especially between agent memory and RAG as retrieval systems become more dynamic (Section 2.3.2).
-
Similarly, latent memory can overlap conceptually with parametric approaches when auxiliary modules generate latent states (Section 3.3.1 discussion).
-
No unified evaluation protocol
-
While the survey compiles many benchmarks (Section 6.1; Table 8), it also emphasizes that evaluation protocols are fragmented across the literature (Introduction / motivation). The survey organizes benchmarks but does not (in the provided content) define a single standardized metric suite that resolves this fragmentation.
-
Operational trade-offs highlighted (by lifecycle stage)
- Formation: summarization/compression can lose detail (Section 5.1.1 summary).
- Evolution:
- consolidation risks smoothing out exceptions (Section 5.2.1 summary),
- updating faces the stabilityâplasticity dilemma (Section 5.2.2 summary),
- forgetting can drop long-tail but important facts (Section 5.2.3 summary).
- Retrieval: autonomous retrieval timing can cause silent failures if the agent fails to retrieve when needed (Section 5.3.1 summary).
7. Implications and Future Directions¶
- How this changes the landscape
-
The survey reframes memory as a first-class primitive in agent design (Abstract/Conclusion) and provides a shared language (
formsâfunctionsâdynamics) that can make future work easier to compare and compose. -
Emerging frontiers explicitly highlighted (Section 7)
- Memory retrieval vs memory generation (Section 7.1):
- The paper anticipates movement from âretrieve and pasteâ to generating memory representations that are context-adaptive and optimized for future utility (Section 7.1.2).
- Automated memory management (Section 7.2):
- A shift from hand-crafted policies to agents that reason about memory operations (add/update/delete/retrieve) as part of their decision loop.
- Reinforcement learning integration (Section 7.3; Figure 11):
- The survey presents a progression from RL-free â partially RL-involved â fully RL-driven memory control, aiming at end-to-end learned memory architectures and policies.
- Multimodal memory (Section 7.4):
- The survey argues real-world agent settings require multimodal memory, but notes current systems are still modality-specialized and not truly âomnimodal.â
- Shared memory for multi-agent systems (Section 7.5):
- Shared memory moves from passive repositories to role-/trust-aware and learning-driven collective representations.
- Memory for world models (Section 7.6):
- Memory is described as central for interactive, long-horizon world simulation, and the survey sketches architectural directions like dual-system designs and active memory policies.
-
Trustworthy memory (Section 7.7):
- The survey highlights privacy, explainability, and hallucination robustness as critical pillars for deployable memory systems, especially because memory can be persistent and user-specific.
-
Repro/Integration Guidance (when to prefer what, based on the surveyâs framing)
- Prefer token-level memory when you need inspectability, auditing, explicit add/delete/update, and stable long-horizon identity (Figure 5; Section 3.4).
- Prefer parametric memory when the goal is internalized, generalizable behavior/knowledge without external retrieval overhead, accepting harder updates and interference risks (Section 3.2; Figure 5).
- Prefer latent memory when token budget is tight or multimodal fusion is central, accepting reduced interpretability (Section 3.3; Figure 5).
- Use the
functionstaxonomy (Figure 6) to align design with need:- personalization and coherence â factual memory,
- self-improvement from experience â experiential memory,
- long-horizon in-episode state management â working memory.
If you want, I can also produce a compact âcheat sheetâ mapping common agent tasks (personal assistant, web research agent, coding agent, embodied agent, multi-agent team) to recommended combinations of (form, function, dynamics operators)âbut Iâll only do that if you ask, since it would go beyond summarizing whatâs explicitly in the provided excerpt.