Deep Research: A Systematic Survey¶

ArXiv: 2512.02038

🎯 Pitch¶

This survey formalizes the Deep Research (DR) paradigm—an end-to-end, tool‑augmented research workflow for LLM agents—and presents a unified roadmap, component taxonomy (query planning, information acquisition, memory, answer generation), optimization methods, and evaluation protocols. By distinguishing DR from conventional RAG and cataloging practical techniques and open challenges, it provides the foundations needed to build reliable, long‑horizon, evidence‑grounded AI researchers capable of multi‑step exploration, verification, and coherent long‑form synthesis.

1. Executive Summary (2-3 sentences)¶

This survey formalizes Deep Research (DR) as an end-to-end, tool-augmented research workflow for LLM agents that iteratively plan queries, acquire/filter evidence, manage memory, and generate long-form, source-grounded reports (Figure 1; Sections 2–4). Its primary significance is providing a unified roadmap + component taxonomy + optimization and evaluation landscape for systems that go beyond single-shot prompting and conventional RAG when tasks require multi-step exploration, verification, and coherent synthesis (Table 1; Sections 2.3, 5, 6).

2. Context and Motivation¶

What specific problem or gap does this paper address?
The paper targets the lack of a comprehensive, systematic survey of DR systems despite rapid progress (Introduction, Section 1).
It argues that existing surveys mostly cover adjacent areas—RAG and web agents—without fully capturing DR’s flexible autonomous workflow and the way it is intended to produce coherent evidence-grounded reports (Section 1; Sections 2.3, 3).
Why is this problem important?
Many real-world tasks are open-ended and require:
- Critical thinking and long-horizon reasoning,
- Multi-source evidence (often from dynamic environments),
- Verifiable outputs with citations/attribution,
- Which exceed what “single-shot prompting” or static model knowledge can reliably provide (Section 1; Section 2.3).
The survey positions DR as a paradigm to bridge this gap by embedding LLMs in an iterative research loop (Sections 1–2).
What prior approaches existed, and where do they fall short?
Conventional RAG is presented as:
- A more fixed, narrower action-space pipeline where retrieval is typically a heuristic augmentation step (Sections 1, 2.3; Table 1).
The survey highlights three limitations that DR is designed to address (Section 2.3):
1. Flexible interaction with the digital world (search engines, web APIs, code executors),
2. Long-horizon planning with autonomous workflows (closed-loop control, iterative refinement),
3. Reliable language interfaces for open-ended tasks (mitigating hallucination/inconsistency via verifiability mechanisms).
How does this paper position itself relative to existing work?
It positions itself as a systematic synthesis:
- A three-phase capability roadmap (Section 2.2; Table 1),
- A four-component decomposition and sub-taxonomies (Section 3; Figures 1–6),
- Optimization paradigms (Section 4),
- Evaluation resources and open challenges (Sections 5–6).
It also presents a taxonomy map of the whole survey (Figure 2), linking components to representative works and benchmarks.

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The “system” in this survey is a conceptual blueprint for LLM-based research agents that repeatedly plan, search, remember, and write.
It solves complex open-ended tasks by organizing them into a closed-loop workflow (planning → evidence acquisition/filtering → memory lifecycle → answer/report generation) plus methods to optimize and evaluate the resulting agent.

3.2 Big-picture architecture (diagram in words)¶

A DR system takes a user query and cycles through four modules (Figure 1; Section 3):

Query Planning → produces sub-queries/sub-tasks and tool-call decisions.
Information Acquisition → retrieves/browses/uses tools, decides when to retrieve, and filters noise.
Memory Management → consolidates intermediate findings, indexes them for recall, updates them when contradicted, and forgets stale info.
Answer Generation → integrates plan + evidence + memory into a coherent, citation-backed long-form output (and potentially multimodal presentations).

3.3 Roadmap for the deep dive¶

First, define what DR is and how it differs from RAG via a three-phase roadmap (Section 2.2; Table 1).
Next, detail the four core workflow components and their sub-taxonomies (Section 3; Figures 3–6).
Then, cover how systems are built/optimized in practice: prompting workflows, SFT, and agentic RL (Section 4; Equations (1)–(7), Table 3).
Finally, explain how DR systems are evaluated and what remains open (Sections 5–6; Tables 4–5).

3.4 Detailed, sentence-based technical breakdown¶

Framing: This is a systematization/survey paper that proposes a unifying roadmap + component taxonomy for deep research agents, and then organizes optimization methods and evaluation benchmarks around that decomposition (Sections 1–5; Figures 1–2).

3.4.1 Formalizing DR via a three-phase capability roadmap¶

The survey defines DR as an end-to-end research loop in which an LLM agent plans and iterates over evidence acquisition, memory, and synthesis to produce verifiable, source-grounded reports with minimal supervision (Section 2.1).

It then structures DR as a capability trajectory with three phases (Section 2.2; Table 1):

Phase I: Agentic Search.
The agent focuses on finding correct sources and extracting answers with limited synthesis.
Evaluation emphasizes localized factual accuracy and operational metrics like retrieval recall and latency (Section 2.2).
Phase II: Integrated Research.
The agent iterates sub-question planning, retrieval from heterogeneous formats (e.g., HTML/tables/charts), and writes coherent structured reports.
Evaluation shifts to long-form criteria: factuality, citation verification, coherence, coverage (Section 2.2).
Phase III: Full-stack AI Scientist.
The agent goes beyond aggregation: hypothesis generation, experimental validation/ablation, critique, and novelty.
Evaluation emphasizes novelty/insightfulness, argument coherence, reproducibility, and calibrated uncertainty (Section 2.2).

Table 1 contrasts these phases with “Standard RAG” along axes like tool access, memory management, reflection, reasoning horizon, workflow flexibility, and output form (Table 1).

3.4.2 The “system/data pipeline diagram in words” (explicit flow)¶

A typical DR run follows this flow (Section 3; Figure 1):

Input: The user provides a complex query.
Planning step: The Query Planning module decomposes the query into sub-queries/sub-tasks (Section 3.1; Figure 3).
Evidence step: The Information Acquisition module chooses retrieval tools, decides when to retrieve, and collects candidate documents/snippets (Section 3.2).
Filtering step: Retrieved results are filtered via ranking, compression, or structure-aware cleaning to reduce noise (Section 3.2.3; Figure 4).
Memory step: Intermediate findings (evidence, partial answers, tool outputs, dialogue) are consolidated into durable memory, indexed for recall, updated when contradicted, and pruned via forgetting (Section 3.3; Figure 5).
Synthesis step: The Answer Generation module integrates plan + evidence + memory, resolves conflicts, maintains long-form coherence, structures the narrative/reasoning, and optionally generates multimodal presentation artifacts (Section 3.4; Figure 6).
Iteration: The system loops: new gaps discovered during synthesis feed back into new planning and acquisition steps until a stop condition/budget is met (Sections 3–4).

3.4.3 Component 1 — Query Planning (how the plan is produced)¶

The survey defines Query Planning as transforming a complex question into an executable sequence of sub-queries/sub-tasks to support stepwise knowledge acquisition and reasoning (Section 3.1).

It organizes planning into three strategies (Figure 3; Sections 3.1.1–3.1.3):

Parallel planning
The planner decomposes/rewrites the query “in a single pass,” producing multiple sub-questions that can be processed concurrently (Section 3.1.1; Figure 3(a)).
The survey’s key trade-off is efficiency vs. adaptability: one-shot decomposition cannot incorporate intermediate evidence and may ignore dependencies among sub-queries (Section 3.1.1).
Sequential planning
The planner decomposes iteratively, where each step depends on previous outputs and can trigger tools dynamically (Section 3.1.2; Figure 3(b)).
The survey highlights that this supports dependency-aware reasoning but increases latency/cost and can accumulate error across turns (Section 3.1.2).
Tree-based planning
The planner explores a branching search space represented as a tree or DAG, enabling pruning/backtracking/heuristic search (Section 3.1.3; Figure 3(c)).
The survey notes that this can balance effectiveness and efficiency but introduces training challenges such as credit assignment in RL and dependency modeling (Section 3.1.3).

3.4.4 Component 2 — Information Acquisition (retrieval tools, when to retrieve, and filtering)¶

The survey splits information acquisition into three subproblems (Section 3.2):

(A) Retrieval tools (what external access looks like)¶

The survey distinguishes:
Text retrieval (lexical vs. semantic/dense vs. commercial web search) (Section 3.2.1),
Multimodal retrieval for figures/tables/layout and grounded pointers for citations (Section 3.2.1).
A key conceptual point is that multimodal retrieval is motivated by information that is not well captured by plain text, and it introduces costs such as OCR sensitivity and cross-modal alignment complexity (Section 3.2.1).

(B) Retrieval timing / adaptive retrieval (when to retrieve)¶

Retrieval timing is defined as deciding when the model should invoke retrieval during iterative reasoning (Section 3.2.2).
The survey motivates adaptive retrieval by noting that:
Retrieval is costly,
Documents can be low-quality or misleading,
So systems should retrieve when the model lacks sufficient knowledge (“knowledge boundary perception”) (Section 3.2.2).

It categorizes “confidence” and retrieval triggering into four families (Section 3.2.2):

Probabilistic confidence (token probabilities as confidence proxies).
Consistency-based confidence (semantic agreement across multiple generations/models/languages).
Internal state probing (use model internal representations as factuality/confidence signals).
Verbalized confidence (the model explicitly indicates uncertainty or emits a retrieval token like <retrieve>).

It also describes an evolution from fixed-per-step retrieval (e.g., always retrieve) toward dynamically triggered retrieval and finally RL-trained retrieval policies (Section 3.2.2).

(C) Information filtering (how retrieved content is refined)¶

The survey defines Information Filtering as selecting/refining retrieved results to reduce noise and prevent misleading contexts (Section 3.2.3). Figure 4 summarizes three classes (Figure 4; Section 3.2.3):

Document selection (rank/select top-k):
Point-wise scoring, pair-wise comparisons, or list-wise global ranking (Section 3.2.3).
Context compression:
Lexical summarization/notes or embedding-based compression into dense tokens (Section 3.2.3).
Rule-based cleaning:
Structure-aware cleanup for formats like HTML or tables (Section 3.2.3).

The survey emphasizes a trade-off: filtering improves robustness but adds compute/latency and can remove useful evidence if too aggressive (Section 3.2.3).

3.4.5 Component 3 — Memory Management (how long-horizon context is sustained)¶

The survey presents memory as a central distinguisher between DR and simpler retrieval-augmented systems: DR needs memory to maintain context across long investigations (Section 3.3; takeaway in that section).

It decomposes memory into four lifecycle operations (Figure 5; Section 3.3):

Memory consolidation
Converts raw interaction history/tool outputs into durable representations.
The survey distinguishes unstructured consolidation (summaries/event logs) vs. structured consolidation (databases/graphs/trees) (Section 3.3.1).
Memory indexing
Builds efficient retrieval pathways over consolidated memories, using metadata/embeddings and retrieval structures (Section 3.3.2).
The survey lists paradigms: signal-enhanced indexing, graph-based indexing, and timeline-based indexing (Section 3.3.2).
Memory updating
Modifies memory in response to new evidence; the survey distinguishes:
- Non-parametric updating (explicit edits in an external store),
- Parametric updating (updating model weights; global/local/modular approaches) (Section 3.3.3).
Memory forgetting
Removes or suppresses irrelevant/outdated content.
The survey distinguishes passive (time/recency/FIFO decay) vs. active forgetting (explicit deletes/invalidation; or parametric unlearning) (Section 3.3.4).

3.4.6 Component 4 — Answer Generation (how evidence becomes a report)¶

The survey treats answer generation as the culminating stage that must go beyond fluent text to: integrate upstream information, resolve conflicts, maintain long-form coherence, structure reasoning/narrative, and potentially output multimodal presentations (Section 3.4; Figure 6).

It organizes answer generation into four stages (Sections 3.4.1–3.4.4):

Integrating upstream information
Combine sub-queries, evidence (ranked/conflicting), and memory state into a context for generation.
The survey highlights “stateful” planning where the evolving memory informs future planning and synthesis (Section 3.4.1).
Synthesizing evidence & maintaining coherence
For conflicts, it lists three strategies (Section 3.4.2):
- Credibility-aware attention (weigh sources differently),
- Multi-agent deliberation (multiple agents analyze and a meta-step synthesizes),
- RL for factuality (reward evidence-supported statements).
For long-form coherence, it discusses the challenge of maintaining logical thread and information density over long outputs (Section 3.4.2).
Structuring reasoning & narrative
It describes prompting and planning methods such as Chain-of-Thought, explicit outline/plan-guided writing, and tool-augmented reasoning (Section 3.4.3).
Presentation generation
The survey broadens DR output from text to structured/multimodal deliverables like slides/posters/audio/video, noting only some systems support this comprehensively (Section 3.4.4; Table 2).

3.4.7 Practical optimization of DR systems (prompting, SFT, RL)¶

The survey groups optimization into three paradigms (Section 4; Figure 2):

Workflow prompt engineering
Builds DR as a multi-agent workflow where an orchestrator delegates subtasks to worker agents and enforces citation/logging/budgets (Section 4.1).
A representative example is a multi-agent architecture with explicit budgeting and evidence provenance logging (Section 4.1.1).
Supervised fine-tuning (SFT)
Used as a “cold start” for agent behaviors before RL (Section 4.2).
Two data-generation paradigms (Section 4.2; Figure 7):
- Strong-to-weak distillation (single-agent or multi-agent teacher generates trajectories),
- Iterative self-evolving (self-training loops that generate and filter data over iterations).
End-to-end agentic reinforcement learning
The paper includes a more mathematical description here (Section 4.3):
- PPO objective with clipping is given by Equations (1)–(5).
- GRPO is described as group-relative normalization, Equations (6)–(7).
- Table 3 defines symbols such as πθ, πθold, Aˆt, rt(θ), group size m, reward R(o|·), etc. (Table 3).

Plain-language paraphrase of PPO/GRPO (with symbol grounding):

In PPO, the policy πθ (the LLM policy) is updated so that outputs with higher estimated advantage Aˆt become more likely, while a clipping term prevents the policy from changing too much relative to a frozen reference πθold (Equations (1)–(2); Table 3).
Aˆt is computed from discounted rewards and a value function Vϕ(st) (Equations (3)–(5)).
In GRPO, instead of relying on an explicit value model, the algorithm samples a group G of m candidate responses for the same query and computes a group-relative advantage by normalizing each reward against the group’s mean and standard deviation (Equation (6)), then applies a PPO-like clipped objective over the group (Equation (7)).

Worked micro-example (illustrative; not a paper result):

Suppose a query produces a response group of m = 3 candidates with scalar rewards R = {2, 5, 8}. Then: - mean(R) = 5, std(R) = 3 (population std for simplicity), and with ϵ ≈ 0: - AˆG for the three responses is {(2−5)/3, (5−5)/3, (8−5)/3} = {−1, 0, +1}. This example illustrates the paper’s point that GRPO trains by relative ranking within a group rather than needing a learned value baseline (Section 4.3.1; Equations (6)–(7)).

Core configurations/hyperparameters note (requirement check): - This paper is a survey and does not present a single unified DR model training run with concrete hyperparameters like batch size, layers, heads, tokens, or hardware for one model. Where it does present algorithmic details, they are in the form of general RL objectives and notation (Section 4.3; Table 3; Equations (1)–(7)), not a specific experimental configuration.

4. Key Insights and Innovations¶

A three-phase roadmap that distinguishes DR from RAG and frames capability progression
The survey’s Phase I/II/III breakdown clarifies that DR is not merely “RAG with better prompts,” but a trajectory from agentic search to integrated long-form synthesis to full-stack scientific workflows (Section 2.2; Table 1).
Significance: it provides a conceptual scaffold to classify systems and align evaluation criteria to capability level.
A four-component decomposition with fine-grained sub-taxonomies
The paper’s main structural contribution is organizing DR into query planning, information acquisition, memory management, and answer generation (Figure 1; Section 3), each with subcategories (e.g., planning types in Figure 3; filtering types in Figure 4; memory lifecycle in Figure 5; generation stages in Figure 6).
Significance: it turns a fast-moving space into a modular design space, enabling clearer comparisons and engineering decisions.
Optimization taxonomy spanning prompting, SFT, and agentic RL (module-level vs pipeline-level)
The survey distinguishes workflow-engineered DR systems from trained agents, then subdivides RL into optimizing a specific module vs the entire pipeline (Sections 4.1–4.3; Figure 2).
Significance: it emphasizes that “agent capability” can be achieved via different engineering/training regimes with different cost/robustness trade-offs.
Explicit attention to retrieval timing and knowledge boundary perception as a core DR problem
Rather than treating retrieval as always-on, the survey elevates adaptive retrieval timing and confidence/uncertainty estimation as a first-class module (Section 3.2.2; Section 6.1).
Significance: this targets a key failure mode for DR systems—over-retrieval, under-retrieval, and being misled by noisy documents.
Evaluation landscape consolidation across tasks, environments, and output formats
The paper collects benchmarks for agentic information seeking, interactive environments, report generation, and “AI for Research” tasks (Section 5; Tables 4–5).
Significance: it provides a starting point for standardized reporting, while also highlighting that long-form evaluation remains an open challenge (Section 6.4).

5. Experimental Analysis¶

Because this is a survey (not a single new model), the “experimental” content is primarily a structured overview of evaluation methodology—datasets, metrics, and environments—rather than reporting new experimental results.

5.1 Evaluation methodology (datasets, metrics, setups)¶

The survey organizes DR evaluation into multiple scenarios (Section 5):

Agentic information seeking (Section 5.1)
Benchmarks evolve from static QA to multi-hop reasoning and then interactive web environments (Sections 5.1.1–5.1.2).
Table 4 lists QA-focused benchmarks with dates, sizes, and metrics such as Exact Match, F1, and Accuracy (Table 4).
Interaction environments (Section 5.1.2)
Evaluations increasingly require web navigation and tool interaction rather than a fixed corpus (Section 5.1.2; Table 5 includes environment-style metrics like Success Rate, Step Success Rate, etc.).
Comprehensive report generation (Section 5.2)
Survey generation, long-form report generation, poster generation, slides generation are treated as distinct evaluation targets with different rubrics/metrics (Sections 5.2.1–5.2.4; Table 5).
AI for Research and software engineering (Sections 5.3–5.4)
Includes idea generation, experimental execution, academic writing, peer review, and software engineering, often relying on environment-based evaluation or judge-based rubrics (Table 5; Sections 5.3–5.4).

5.2 Main quantitative details provided (as cataloged numbers)¶

The paper includes benchmark sizes and metrics in tables; for example:

Table 4 (QA benchmarks) provides dataset sizes such as:
NQ: 307,373/7,830/7,842 (train/dev/test) with Exact Match / F1 / Accuracy.
HotpotQA: 90,124 / 5,617 / 5,813 with Exact Match / F1 / Accuracy.
Others include GPQA (448, Accuracy), GAIA (450, Exact Match), BrowseComp (1266, Exact Match), HLE (2500, Exact Match / Accuracy) (Table 4).
Table 5 (broader scenarios) enumerates many benchmarks and their evaluation styles, including:
DeepResearchGym with 96,000 and metrics like report relevance / retrieval faithfulness / report quality (Table 5).
AutoSurvey with 530,000 and metrics including citation and content quality (Table 5).
Several report/poster/slides datasets with varying sizes and judge-based assessments (Table 5).

5.3 Does the evaluation coverage support the paper’s claims?¶

The survey convincingly supports its mapping and taxonomy claims by:
Providing aligned figures and tables that connect modules to representative work (Figure 2),
Explicitly enumerating evaluation benchmarks and metrics for multiple DR scenarios (Section 5; Tables 4–5).
However, because it is not itself an empirical comparison study, it does not directly validate “which design wins” across the taxonomy with controlled experiments. Instead, it provides the structure needed for others to do so.

5.4 Ablations, failure cases, robustness checks¶

The paper does not include ablation studies (it is a survey).
It does discuss failure modes and robustness issues conceptually, e.g.:
Noise sensitivity from irrelevant retrieval (Section 3.2.3),
Over-retrieval/under-retrieval and insufficient stepwise reward guidance (Section 6.1),
Instability in multi-turn RL (Section 6.3),
Bias and inefficiency in LLM-as-judge evaluation (Section 6.4.3).

6. Limitations and Trade-offs¶

Survey scope vs. empirical grounding
The paper provides structure and coverage but does not produce new head-to-head experimental evidence comparing approaches under controlled settings. This limits the strength of any “best practice” prescriptions.
Compute/latency vs. quality trade-offs are central and unresolved
Sequential and tree-based planning can be more effective but costlier and more complex (Sections 3.1.2–3.1.3).
Filtering improves robustness but adds latency and risks discarding useful evidence (Section 3.2.3).
Retrieval timing remains a core unsolved challenge
The paper explicitly notes that systems can still retrieve repeatedly and remain wrong, and that over/under retrieval can occur when reward signals are too coarse (Section 6.1).
Memory evolution is still immature
The survey argues existing memory modules often behave as passive buffers and highlights open needs in proactive personalization, structured memory, and goal-driven memory optimization (Section 6.2).
Training instability for multi-turn RL
The paper emphasizes that PPO/GRPO can become unstable in multi-turn settings, with issues like reward drops and entropy collapse, and it summarizes emerging mitigations (Section 6.3).
Evaluation remains a bottleneck for long-form DR
Long-form evaluation is hard without a single gold answer; the paper flags gaps in logical evaluation, the novelty-vs-hallucination boundary, and judge bias/efficiency (Sections 6.4.1–6.4.3).

7. Implications and Future Directions¶

How this work changes the landscape
By formalizing DR into phases and modular components (Section 2.2; Section 3), the survey pushes the field toward treating “research agents” as systems with explicit control loops rather than monolithic prompting recipes.
Follow-up research enabled/suggested
Retrieval timing: develop finer-grained step-level reward signals and uncertainty estimation that decides not only when to retrieve but also whether to answer at all when evidence is insufficient (Section 6.1).
Memory evolution: move from passive logs to proactive user models; adopt structured representations (graphs/timelines) with dynamic update/forgetting policies and goal-driven RL formulations (Section 6.2).
Stable multi-turn RL: design cold-start methods that preserve exploration, and create denser/smoother reward functions particularly for GRPO-like approaches (Section 6.3.2).
Evaluation: develop logical evaluation across multiple granularities, and methods to separate creative novelty from hallucinated novelty; reduce LLM-as-judge bias and cost (Section 6.4).
Practical applications / downstream use cases
The paper explicitly points to applications spanning:
- Agentic information seeking and long-horizon QA (Phase I/II framing in Section 2.2),
- Market/competitive analysis, policy briefs, itinerary planning, long-form reporting (Section 2.2),
- AI-for-research tasks like idea generation, experiment execution, academic writing, and peer review (Section 5.3),
- Software engineering tasks (Section 5.4).
Repro/Integration Guidance (when to prefer what, based on the survey)
Prefer conventional RAG when:
- The task is short-span, truth is localized, workflow can be fixed, and you want predictable runtime (Table 1; Phase I description in Section 2.2).
Prefer DR-style agentic workflows when:
- The task is open-ended, requires iterative exploration, conflict resolution, memory over many steps, and coherent long-form synthesis with citations (Sections 2.1–2.3; Table 1; Figure 1).
Prefer workflow prompting/multi-agent orchestration when:
- You need a practical system without full RL training, and you can enforce budgets, logging, and delegation via prompting (Section 4.1).
Prefer SFT + RL when:
- You want a single agent that internalizes tool-use and planning behaviors rather than relying on a complex orchestrated pipeline (Sections 4.2–4.3), while recognizing the instability and evaluation risks highlighted in Sections 6.3–6.4.