Skip to content

Memory in the Age of AI Agents: A Survey

Forms, Functions and Dynamics

ArXiv: 2512.13564

🎯 Pitch

This survey introduces a unified taxonomy for agent memory across three axes—forms (token-level, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval)—clarifying inconsistent terminology and linking architectures to evaluation practices. By reframing memory as a first‑class, operational substrate for long‑horizon reasoning, continual adaptation, and multi‑agent/multimodal interaction, it provides researchers and practitioners a coherent foundation to design, benchmark, and advance memory‑enabled AI agents with real‑world longevity and reliability.


1. Executive Summary (2-3 sentences)

This survey builds a unified, up-to-date taxonomy for “memory” in foundation-model-based AI agents by organizing the space along three axes—forms, functions, and dynamics—to resolve fragmented terminology and mismatched evaluation practices. It matters because memory is positioned as the enabling substrate for long-horizon reasoning, continual adaptation, and coherent interaction across sessions, and existing “short/long-term memory” labels are argued to be too coarse for modern agent systems.

2. Context and Motivation

  • What specific problem/gap is addressed?
  • The survey targets conceptual fragmentation in “agent memory” research: works differ in motivation, implementation, assumptions, and evaluation protocols, while memory terms (e.g., episodic/semantic/parametric) are used inconsistently.
  • It argues traditional taxonomies (especially “short-term vs long-term memory”) do not capture the diversity and dynamics of memory mechanisms now used in LLM-based agents.

  • Why is this problem important?

  • Memory is framed as a core capability for agents that must act over time, interact with environments, and adapt (Introduction).
  • The survey highlights application areas that depend on persistent memory-like behavior (e.g., personalized chatbots, recommender systems, social simulations, financial investigations—Introduction), and also positions memory as foundational for continual evolution through environment interaction.

  • What prior approaches existed, and where do they fall short?

  • Prior surveys exist, but the paper claims their taxonomies were developed before rapid advances (notably in 2025), leaving emerging directions underrepresented (Section “Agent Memory Needs A New Taxonomy”).
  • A key shortcoming is over-reliance on temporal labels (short/long-term) rather than distinguishing what is stored, what it is used for, and how it evolves.

  • How does this survey position itself relative to existing work?

  • It explicitly proposes a three-lens framework:
    • Forms = where/how memory is represented (Section 3; Figure 1).
    • Functions = why memory is needed (Section 4; Figure 6).
    • Dynamics = how memory is formed/evolved/retrieved (Section 5; Figure 8).
  • It also delineates scope by distinguishing agent memory from adjacent concepts like LLM memory, RAG, and context engineering (Section 2.3; Figure 2).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The “system” here is a survey framework: a structured way to classify and reason about memory mechanisms used in LLM-based agents.
  • It solves the “shape” of the problem by giving (i) formal definitions, (ii) a taxonomy across representation and purpose, and (iii) a lifecycle model (formation → evolution → retrieval) for how memory operates over time.

3.2 Big-picture architecture (diagram in words)

  • Box 1: Formal agent + environment loop (Section 2.1) → defines state s_t, observations o_t^i, actions a_t, and trajectories τ.
  • Box 2: Memory state M_t (Section 2.2) → a generic container (text buffer / KV store / vector DB / graph / hybrid).
  • Box 3: Memory operators (Section 2.2):
  • F: formation (extract/store useful artifacts),
  • E: evolution (consolidate/update/forget/restructure),
  • R: retrieval (produce memory signal m_t^i for the agent).
  • Box 4: Taxonomies layered on top (Sections 3–5):
  • Forms (token-level / parametric / latent),
  • Functions (factual / experiential / working),
  • Dynamics (formation / evolution / retrieval pipelines).

3.3 Roadmap for the deep dive

  • I first explain the paper’s formal definitions for agents and memory (π, M_t, F/E/R) because later taxonomies reuse these concepts (Section 2).
  • Next I describe scope boundaries vs. LLM memory, RAG, and context engineering to avoid conflating related work (Section 2.3; Figure 2).
  • Then I unpack the Forms taxonomy (Section 3; Figures 3–5), since it answers “what carries memory?”
  • After that, I cover Functions (Section 4; Figures 6–7), which answers “why does an agent need memory?”
  • Finally, I walk through Dynamics (Section 5; Figures 8–10; Tables 7) as the operational lifecycle tying everything together.

3.4 Detailed, sentence-based technical breakdown

This is a survey/taxonomy paper with a formalization component; its core idea is that “memory for agents” should be understood through (i) representation forms, (ii) behavioral functions, and (iii) lifecycle dynamics, rather than only short/long-term time horizons (Figure 1; Sections 3–5).

3.4.1 Formalizing LLM-based agent systems (Section 2.1)

  • The paper models an agent system (single- or multi-agent) as interaction with an environment state space S, where the environment transitions via a controlled stochastic dynamics model:
  • s_{t+1} ~ Ψ(s_{t+1} | s_t, a_t).
  • Each agent i receives an observation:
  • o_t^i = O_i(s_t, h_t^i, Q),
  • where h_t^i is the visible history portion and Q is a task specification (instruction/goal/constraints).
  • Actions are explicitly heterogeneous (not just text), including natural language, tool invocation, planning outputs, environment control, and communication actions (Section 2.1).
  • The policy is written as:
  • a_t = π_i(o_t^i, m_t^i, Q),
  • where m_t^i is a memory-derived signal defined next.

3.4.2 Formalizing “agent memory systems” via a unified memory state and operators (Section 2.2)

  • Memory is represented as an evolving state:
  • M_t ∈ M,
  • where M is the set of admissible memory configurations and no specific structure is assumed (it could be text, vectors, a graph, etc.).
  • Memory is integrated into the agent loop through three conceptual operators:
  • Formation: the agent produces artifacts φ_t (tool outputs, reasoning traces, plans, self-evals, environment feedback), and the formation operator selectively transforms them:
    • M_{t+1}^{form} = F(M_t, φ_t).
  • Evolution: formed candidates are integrated/cleaned/restructured:
    • M_{t+1} = E(M_{t+1}^{form}),
    • with examples including consolidating redundancy, resolving conflicts, discarding low-utility items, or restructuring for retrieval (Section 2.2).
  • Retrieval: at decision time, the agent retrieves a context-dependent signal:
    • m_t^i = R(M_t, o_t^i, Q),
    • and m_t^i is formatted for LLM consumption (snippets/summary/structured content).
  • A key modeling choice is that short-term vs long-term effects are not forced into separate modules: they can “emerge” from invocation patterns of F/E/R (e.g., retrieve only at t=0, or continuously), rather than architectural separation (Section 2.2).
Worked micro-example (illustrative, based on the paper’s operators)
  • Suppose an agent is doing a multi-step web task.
  • At step t, it calls a tool and receives output plus feedback; these are part of φ_t.
  • F(M_t, φ_t) might extract a concise “lesson” (e.g., a successful navigation pattern) rather than storing the entire raw log.
  • E(...) might merge that lesson into an existing set of similar lessons and remove duplicates.
  • Later, when asked a related question, R(M_t, o_t^i, Q) retrieves the most relevant lesson(s) as m_t^i to guide the next action.

3.4.3 Scope boundaries: agent memory vs. neighboring concepts (Section 2.3; Figure 2)

The survey argues many confusions come from overlapping implementations but different goals:

  • Agent memory vs LLM memory (Section 2.3.1)
  • The paper says “agent memory” largely subsumes what used to be called “LLM memory” in 2023–2024 (e.g., early systems framed as giving LLMs memory but functionally enabling agent-like persistence across interactions).
  • It draws a boundary: work focused on internal model mechanisms (e.g., KV cache management, long-context architectures, attention sparsity) is categorized as LLM memory and often outside the scope of agent memory because it does not necessarily provide cross-task persistence or deliberate F/E/R operations.

  • Agent memory vs RAG (Section 2.3.2)

  • Both use retrieval stacks (vector indices, semantic search, graphs), but classical RAG is framed as augmenting an LLM with static external knowledge for single tasks, while agent memory is framed as a persistent, self-evolving internal memory base built from the agent’s own interactions and feedback.
  • The paper notes the boundary is increasingly blurred (e.g., dynamic retrieval, systems interpreted as memory), so it also distinguishes them pragmatically by evaluation domains: RAG often appears in multi-hop QA benchmarks (the survey lists examples), while agent memory is often tested in sustained interactive settings and agent benchmarks (Section 2.3.2).

  • Agent memory vs context engineering (Section 2.3.3)

  • Context engineering is treated as optimizing the context window as a resource, whereas agent memory is treated as maintaining a persistent cognitive state (what the agent knows/has experienced) across tasks.
  • The overlap is especially strong in working memory and long-horizon interaction, where both rely on compression, organization, and selection (Section 2.3.3).

3.4.4 Forms: what carries memory? (Section 3; Figures 3–5)

The survey’s first major taxonomy is representational:

  • Token-level memory (Section 3.1)
  • Defined as storing memory in explicit, discrete, externally accessible units (“tokens” broadly includes text and other discrete modality units).
  • The key sub-taxonomy is by topological organization (Figure 3):
    • Flat (1D): no explicit inter-unit structure; a list/bag of chunks/dialogue/experiences/summaries (Section 3.1.1; Table 1).
    • Planar (2D): explicit single-layer structure like trees/graphs/tables but without multi-layer hierarchy (Section 3.1.2; Table 1).
    • Hierarchical (3D): multi-layer structures enabling coarse-to-fine abstraction and cross-layer reasoning (Section 3.1.3; Table 1).
  • The survey repeatedly stresses the trade-off: token-level memory is transparent, editable, auditable, but can suffer from noise/redundancy and relies heavily on retrieval quality as it scales (Section 3.1 “Discussion”).

  • Parametric memory (Section 3.2; Table 2)

  • Memory is stored in model parameters, accessed implicitly during forward computation.
  • Two types are distinguished (Section 3.2):
    • Internal parametric memory: memory injected into original weights; categorized further by training phase (pre-train / mid-train / post-train) in Table 2.
    • External parametric memory: memory stored in additional parameter sets (e.g., adapters/LoRA modules or auxiliary models) without changing base weights (Section 3.2.2; Table 2).
  • The survey highlights typical trade-offs: lower inference overhead but harder updates, and risks like interference/catastrophic forgetting (Section 3.2 “Discussion”).

  • Latent memory (Section 3.3; Figure 4; Table 3)

  • Defined as memory held in internal representations (KV cache, activations, hidden states, latent embeddings) rather than explicit tokens or parameters.
  • Organized by the “origin” of latent state (Figure 4; Section 3.3):
    • Generate: auxiliary module produces latent memory units reused later (Section 3.3.1; Table 3).
    • Reuse: carry over internal computational states such as KV caches (Section 3.3.2; Table 3).
    • Transform: compress/reshape internal state via selection/merging/projection (Section 3.3.3; Table 3).
  • The survey frames latent memory as token-efficient and suitable for multimodal fusion, but less interpretable/debuggable (Section 3.3 discussions; Figure 5).

  • Adaptation guidance across forms (Section 3.4; Figure 5)

  • Figure 5 explicitly maps each form to “features” and “suitable applications”:
    • Token-level: symbolic/transparent; good for chatbots, personalization, recommender systems, high-stakes domains.
    • Parametric: implicit/generalizable; good for role-playing and tasks requiring fundamentally new capabilities.
    • Latent: machine-native/token-efficient; good for multimodal memory and on-device/edge or low-resource settings.

3.4.5 Functions: why agents need memory? (Section 4; Figure 6)

The second major taxonomy is functional, moving beyond time-based labels:

  • The survey defines three primary functional pillars (Figure 6; Section 4):
  • Factual memory: declarative facts about user/environment; answers “What does the agent know?” (Section 4.1).
  • Experiential memory: procedural/strategic knowledge distilled from success/failure and trajectories; answers “How does the agent improve?” (Section 4.2; Figure 7).
  • Working memory: capacity-limited workspace for within-episode context management; answers “What is the agent thinking about now?” (Section 4.3).

Key sub-structures the paper introduces: - Factual memory splits by target entity (Table 4): - User factual memory (dialogue coherence, goal consistency; Section 4.1.1). - Environment factual memory (knowledge persistence, shared access for multi-agent collaboration; Section 4.1.2). - Experiential memory splits by abstraction level (Figure 7; Table 5; Section 4.2): - Case-based: raw trajectories/solutions as exemplars (Section 4.2.1). - Strategy-based: distilled insights/workflows/patterns (Section 4.2.2). - Skill-based: executable capabilities (code snippets, functions/scripts, APIs, MCPs) (Section 4.2.3). - Plus Hybrid mixtures (Section 4.2.4). - Working memory splits by interaction dynamics (Section 4.3; Table 6): - Single-turn: input condensation + observation abstraction (Section 4.3.1). - Multi-turn: state consolidation + hierarchical folding + cognitive planning (Section 4.3.2).

3.4.6 Dynamics: how memory operates and evolves over time (Section 5; Figures 8–10; Table 7)

The third axis models memory as a lifecycle:

  • Figure 8 frames the operational loop as:
  • Memory Formation (Section 5.1): extract compact knowledge from raw data.
  • Memory Evolution (Section 5.2): integrate new memories into the memory base (consolidate/update/forget).
  • Memory Retrieval (Section 5.3): decide when/what/how to retrieve and how to post-process results.
  • The survey then classifies techniques within each stage:
  • Formation has five categories (Section 5.1; Table 7):
    • Semantic summarization (incremental vs partitioned; Section 5.1.1).
    • Knowledge distillation (factual vs experiential; Section 5.1.2).
    • Structured construction (entity-level vs chunk-level; Section 5.1.3).
    • Latent representation (textual vs multimodal latent; Section 5.1.4).
    • Parametric internalization (knowledge vs capability internalization; Section 5.1.5).
  • Evolution has three mechanisms (Section 5.2; Figure 9):
    • Consolidation (local / cluster-level / global integration; Section 5.2.1).
    • Updating (external memory update vs model editing; Section 5.2.2).
    • Forgetting (time-based / frequency-based / importance-driven; Section 5.2.3).
  • Retrieval is decomposed into four steps (Section 5.3; Figure 10):
    • timing & intent,
    • query construction (decomposition/rewriting),
    • retrieval strategies (lexical/semantic/graph/generative/hybrid),
    • post-retrieval processing (re-ranking/filtering; aggregation/compression).

3.4.7 Core configurations and hyperparameters (required element)

  • This paper is a survey and does not present a single trained model or unified training run, so global training hyperparameters (optimizer, learning rate, batch size, total tokens, hardware) are not applicable as a paper-level configuration.
  • Where optimization details are part of the taxonomy, the survey explicitly names representative optimization approaches at the method level (e.g., RL methods like PPO and GRPO appear in the discussion of RL-optimized summarization in Section 5.1.1 and in Table 7, and multiple tables include an “Optimization” column such as Tables 2, 4, 5, 6).

4. Key Insights and Innovations

  • A unified “Forms–Functions–Dynamics” taxonomy for agent memory (Figure 1; Sections 3–5).
  • Novelty: Rather than labeling memory by “short vs long term,” it triangulates memory by representation, purpose, and lifecycle behavior.
  • Significance: This framing directly addresses fragmentation by letting two systems that both call themselves “memory” be compared along explicit axes (e.g., token-level working memory vs latent working memory; factual vs experiential).

  • Clear scoping distinctions among agent memory, LLM memory, RAG, and context engineering (Section 2.3; Figure 2).

  • Novelty: The survey does not treat these as synonymous; it explains overlap in tooling (e.g., retrieval stacks) while separating them by whether the system maintains a persistent, self-evolving cognitive state (agent memory) vs. internal architecture optimizations (LLM memory) vs. static external grounding (classical RAG) vs. context window resource management (context engineering).
  • Significance: This is practically important because evaluation claims and design choices differ depending on which category you’re actually in.

  • A representational topology taxonomy within token-level memory: flat (1D) → planar (2D) → hierarchical (3D) (Section 3.1; Figure 3; Table 1).

  • Novelty: Memory is categorized by structural complexity and how that affects retrieval and reasoning, not just by “episodic/semantic” labels.
  • Significance: It helps explain why certain agents shift from simple vector stores to graphs/trees/hierarchies as tasks demand multi-hop reasoning and abstraction.

  • A functional breakdown into factual, experiential, and working memory (Section 4; Figure 6), with experiential memory further split by abstraction level (Figure 7; Table 5).

  • Novelty: It explicitly distinguishes “knowing facts” from “learning how to do things” from “managing in-episode workspace,” and connects these to different design patterns (e.g., skill memory as APIs/MCPs vs trajectories).
  • Significance: This clarifies what to store and how to evaluate: a benchmark testing preference recall should not be treated as measuring “experiential learning.”

  • A lifecycle / operational view (F/E/R) plus detailed decompositions of formation/evolution/retrieval (Section 2.2; Section 5; Figures 8–10; Table 7).

  • Novelty: The survey turns memory into an explicit process model with operators, not just a storage component.
  • Significance: It gives a common language for implementing memory managers and comparing systems that differ mainly in when/how they write, consolidate, forget, and retrieve.

5. Experimental Analysis

Because this is a survey, there is no single experimental setup whose results validate a new model; instead, the empirical contribution is a structured compilation of benchmarks and frameworks, plus taxonomies and comparisons.

  • Evaluation methodology covered by the survey
  • The survey compiles and categorizes benchmarks for memory/lifelong/self-evolving agents and also “other” benchmarks that stress memory implicitly (Section 6.1; Table 8).
  • It also compiles open-source memory frameworks and compares their supported memory types and structures (Section 6.2; Table 9).

  • Main quantitative details explicitly provided

  • Table 8 reports benchmark “Scale” values (samples/tasks) for many datasets. Examples visible in the provided excerpt include:
    • MemBench: 53,000 samples (Table 8).
    • LoCoMo: 300 samples (Table 8).
    • MT-Mind2Web: 720 samples (Table 8).
    • LongBench: 21 tasks / 4,750 samples (Table 8).
    • MM-Needle: approximately 280,000 samples (Table 8).
    • HotpotQA: 113k samples (Table 8).
  • These are presented as catalog metadata, not as performance results.

  • Baselines and comparisons

  • The paper includes comparison tables for methods (e.g., Table 1 token-level methods; Table 2 parametric memory methods; Tables 4–6 functional method taxonomies).
  • These comparisons are categorical (carrier, structure, task domain, optimization type), not score-based.

  • Do experiments support claims?

  • The survey’s claims are primarily organizational and definitional, and the provided content supports those through:
    • formal definitions (Section 2),
    • taxonomies with diagrams (Figures 1–11),
    • method listings and structured tables (Tables 1–7, 8–9).
  • There is no single “claim of improved accuracy” that would require ablations; instead, the “support” is the breadth and structured mapping of the literature.

  • Ablations / robustness checks / failure cases

  • Not applicable at the paper level: the survey does not introduce a new algorithm evaluated via ablations.
  • It does, however, discuss conceptual failure modes (e.g., retrieval noise, silent failure when retrieval is not triggered, stability–plasticity dilemma in updates, long-tail loss in forgetting) in Sections 5.2–5.3.

6. Limitations and Trade-offs

  • Limited by survey scope and the moving target of the literature
  • The paper itself flags the rapid expansion of the field and invites readers to report missing papers (Introduction note), implying inevitable incompleteness.

  • Taxonomy boundaries are not perfectly separable

  • The paper explicitly notes blurred boundaries, especially between agent memory and RAG as retrieval systems become more dynamic (Section 2.3.2).
  • Similarly, latent memory can overlap conceptually with parametric approaches when auxiliary modules generate latent states (Section 3.3.1 discussion).

  • No unified evaluation protocol

  • While the survey compiles many benchmarks (Section 6.1; Table 8), it also emphasizes that evaluation protocols are fragmented across the literature (Introduction / motivation). The survey organizes benchmarks but does not (in the provided content) define a single standardized metric suite that resolves this fragmentation.

  • Operational trade-offs highlighted (by lifecycle stage)

  • Formation: summarization/compression can lose detail (Section 5.1.1 summary).
  • Evolution:
    • consolidation risks smoothing out exceptions (Section 5.2.1 summary),
    • updating faces the stability–plasticity dilemma (Section 5.2.2 summary),
    • forgetting can drop long-tail but important facts (Section 5.2.3 summary).
  • Retrieval: autonomous retrieval timing can cause silent failures if the agent fails to retrieve when needed (Section 5.3.1 summary).

7. Implications and Future Directions

  • How this changes the landscape
  • The survey reframes memory as a first-class primitive in agent design (Abstract/Conclusion) and provides a shared language (forms–functions–dynamics) that can make future work easier to compare and compose.

  • Emerging frontiers explicitly highlighted (Section 7)

  • Memory retrieval vs memory generation (Section 7.1):
    • The paper anticipates movement from “retrieve and paste” to generating memory representations that are context-adaptive and optimized for future utility (Section 7.1.2).
  • Automated memory management (Section 7.2):
    • A shift from hand-crafted policies to agents that reason about memory operations (add/update/delete/retrieve) as part of their decision loop.
  • Reinforcement learning integration (Section 7.3; Figure 11):
    • The survey presents a progression from RL-free → partially RL-involved → fully RL-driven memory control, aiming at end-to-end learned memory architectures and policies.
  • Multimodal memory (Section 7.4):
    • The survey argues real-world agent settings require multimodal memory, but notes current systems are still modality-specialized and not truly “omnimodal.”
  • Shared memory for multi-agent systems (Section 7.5):
    • Shared memory moves from passive repositories to role-/trust-aware and learning-driven collective representations.
  • Memory for world models (Section 7.6):
    • Memory is described as central for interactive, long-horizon world simulation, and the survey sketches architectural directions like dual-system designs and active memory policies.
  • Trustworthy memory (Section 7.7):

    • The survey highlights privacy, explainability, and hallucination robustness as critical pillars for deployable memory systems, especially because memory can be persistent and user-specific.
  • Repro/Integration Guidance (when to prefer what, based on the survey’s framing)

  • Prefer token-level memory when you need inspectability, auditing, explicit add/delete/update, and stable long-horizon identity (Figure 5; Section 3.4).
  • Prefer parametric memory when the goal is internalized, generalizable behavior/knowledge without external retrieval overhead, accepting harder updates and interference risks (Section 3.2; Figure 5).
  • Prefer latent memory when token budget is tight or multimodal fusion is central, accepting reduced interpretability (Section 3.3; Figure 5).
  • Use the functions taxonomy (Figure 6) to align design with need:
    • personalization and coherence → factual memory,
    • self-improvement from experience → experiential memory,
    • long-horizon in-episode state management → working memory.

If you want, I can also produce a compact “cheat sheet” mapping common agent tasks (personal assistant, web research agent, coding agent, embodied agent, multi-agent team) to recommended combinations of (form, function, dynamics operators)—but I’ll only do that if you ask, since it would go beyond summarizing what’s explicitly in the provided excerpt.