Meta-Reinforcement Learning with Self-Reflection for Agentic Search¶

Pitch¶

MR-Search reframes agentic search from isolated single-episode RL tasks into a meta-learning problem where agents condition on past failed attempts and their self-reflections to improve in-context exploration. By replacing expensive external reward models with learned cross-episode self-correction, this approach achieves 9.2% to 19.3% relative improvements across eight benchmarks, enabling agents to progressively refine their search strategies at test time.

1. Executive Summary¶

This paper introduces MR-Search, an in-context meta-reinforcement learning framework that trains language model agents to perform multi-episode agentic search with explicit self-reflection. The core innovation is reformulating search from independent single-episode RL (where each attempt starts fresh with sparse outcome rewards) into a meta-RL paradigm where agents condition on past failed attempts and their reflections, enabling dense turn-level credit assignment and progressive in-context improvement. MR-Search achieves 9.2% to 19.3% relative improvement over the strong Search-R1 baseline across eight QA benchmarks using Qwen2.5-3B and Qwen2.5-7B models, demonstrating that structured cross-episode reflection can replace expensive external process reward models.

2. Context and Motivation¶

The Core Problem: Sparse Rewards in Agentic Search¶

The paper addresses a fundamental challenge in training language model agents for multi-step search tasks: outcome rewards are sparse and delayed, making credit assignment extremely difficult. When an agent searches through documents, makes multiple tool calls, and ultimately produces an incorrect answer, standard RL methods provide no signal about which intermediate steps were problematic. The agent only learns that the final trajectory was wrong, leading to inefficient exploration, convergence to local optima, and poor search dynamics.

This problem is particularly acute in agentic search (also called "deep research"), where LLMs must: - Decide when to call search tools and what queries to issue - Process retrieved documents across multiple turns - Aggregate information from diverse sources - Navigate complex multi-hop reasoning chains

A single early error—such as an imprecise search query—can cascade through subsequent reasoning, yet the agent receives no feedback until the final answer is evaluated. Traditional RL algorithms like PPO or GRPO, which operate on single independent episodes, struggle to learn effective multi-turn strategies under these conditions.

Why This Problem Matters¶

The significance extends beyond academic benchmarking. Agentic search represents a critical capability for building autonomous AI systems that can conduct research, answer complex questions, and interact with external knowledge sources. The authors note several concrete applications mentioned in related work: "deep research" systems (Du et al., 2025; Shao et al., 2025), information-seeking agents (Mialon et al., 2023; Jin et al., 2025a), and multi-step decision-making in complex tasks.

The sparse reward problem limits the scalability and effectiveness of these systems. Without better credit assignment, agents cannot learn fine-grained search strategies—they either stumble upon correct answers through brute-force exploration or plateau at mediocre performance. This bottleneck constrains the practical deployment of agentic systems in high-stakes domains like legal research, medical literature review, or competitive intelligence.

Prior Approaches and Their Limitations¶

The paper situates itself against three categories of prior work:

Standard RL-based search agents (Search-R1, ReSearch). Methods like Search-R1 (Jin et al., 2025a) and ReSearch (Chen et al., 2025) train LLMs end-to-end using PPO or GRPO under the ReAct paradigm (Yao et al., 2022). The agent generates thought-action-observation sequences and receives a final outcome reward based on answer correctness. The limitation is clear: "these methods rely solely on sparse outcome rewards, without providing precise credit assignment for effective exploration" (Section 1). The paper explicitly cites Feng et al. (2025) on how multi-turn interactions amplify small errors and obscure credit assignment.

Process reward models (PRMs) and step-level supervision. To address sparse rewards, works like PPRM (Anonymous, 2026) and StepResearch (Wang et al., 2025b) introduce intermediate rewards at each step, typically using external models to evaluate reasoning quality. However, the paper identifies critical weaknesses: "these approaches rely on external annotations, which are both costly and difficult to reuse when task requirements change. Moreover, model-based rewards inevitably lead to reward hacking and bias (Wang et al., 2025a) and incur additional computational overhead in the RL training" (Section 1).

In-context self-correction methods. Approaches like Reflexion (Shinn et al., 2023), Self-Refine (Madaan et al., 2023), and self-correction methods (Huang et al., 2023) enable LLMs to iteratively refine outputs via intrinsic feedback at inference time. However, these methods are primarily prompting-based and do not involve end-to-end training. The paper notes: "In contrast, our work offers a novel meta-RL perspective on agentic search with tool interaction to catalyze continuous self-reflection, enabling the model to more effectively explore better answers" (Section 2).

How This Paper Positions Itself¶

MR-Search draws inspiration from meta-reinforcement learning (Duan et al., 2016; Wang et al., 2017; Laskin et al., 2023), where agents learn to learn—in-context adaptation allows policies to improve based on experience from initial exploration episodes. The key insight is that LLMs already possess strong in-context learning capabilities, which can be harnessed for structured self-reflection without external reward models.

The paper makes an explicit distinction from prior meta-RL work: "Unlike traditional meta-RL approaches (Duan et al., 2016; Stadie et al., 2018; Laskin et al., 2023) in robotics and games, we focus on open-domain agentic search tasks with tool interactions and self-reflection, without any reward feedback from the environment during inference" (Section 1). The setting differs because there is no intermediate environment feedback during deployment—only final answer verification.

Critically, the paper positions MR-Search as a training framework that produces agents capable of test-time adaptation, rather than a test-time inference trick. The meta-RL training objective explicitly optimizes for cross-episode improvement, teaching the model to generate useful reflections and leverage them effectively. This distinguishes the work from methods like Search-R1 with post-hoc reflection prompting, where the underlying policy was never trained to use reflections productively.

3. Technical Approach¶

3.1 Reader Orientation¶

MR-Search is a training framework that teaches language model agents to conduct multi-episode search with explicit self-reflection, where each episode's reflection serves as context for subsequent attempts, enabling progressively informed exploration without external process supervision.

The system solves the sparse reward problem by structuring training as meta-episodes—sequences of search episodes with reflections—and using a turn-level advantage estimation that assigns dense credit to each episode based on its marginal contribution to final performance.

3.2 Big-Picture Architecture (Diagram in Words)¶

The system has four major components:

Base LLM Policy (π_θ) — A Qwen2.5-3B or 7B model that generates thought-action-observation sequences, interleaving reasoning with search tool calls. This serves as the agent being trained.
Search Environment (E) — A Wikipedia-based retrieval system using E5 embeddings, returning up to 3 documents per query. The environment provides observations in response to the agent's search actions but gives no intermediate feedback.
Reflection Generator — The same LLM policy, prompted after each episode to produce an explicit self-reflection that analyzes what went wrong and what additional information is needed. Reflections become part of the context for subsequent episodes.
Meta-RL Optimizer — A turn-level policy gradient algorithm that estimates advantages using Leave-One-Out (RLOO) estimation over groups of meta-episodes, then applies discounted cumulative advantages with PPO-style clipping to update the policy.

Information flows as follows: a question enters → the agent generates episode 1 (thoughts, tool calls, observations, answer) → the agent generates a reflection on episode 1 → this concatenates into context → the agent generates episode 2 conditioned on the reflection → and so on for N episodes → each episode's answer is verified against ground truth → rewards are computed per episode → the optimizer computes turn-level advantages and updates the policy.

3.3 Roadmap for the Deep Dive¶

First, the formal meta-episode structure (Section 3.1-3.2 of the paper), which defines how episodes, reflections, and meta-episodes are organized and why this structure enables dense supervision.
Second, the meta-RL objective function (Equation 6), which formalizes what quantity is being maximized and how it differs from standard RL objectives.
Third, the turn-level advantage estimation (Equations 7-8), which is the technical core enabling fine-grained credit assignment without a value function.
Fourth, the PPO-style optimization objective (Equation 9), which specifies how the policy is updated given the advantages.
Fifth, extensions and design variations—exploration-exploitation masking, step-level meta-RL, and context management—which demonstrate the flexibility of the framework.

3.4 Detailed Technical Breakdown¶

This is primarily a methods paper introducing a new training framework that bridges meta-RL and agentic search. The core idea is structuring agent training around meta-episodes with explicit reflection, enabling dense credit assignment without external reward models.

Background: The ReAct Paradigm for Agentic Search¶

Before introducing MR-Search, the paper reviews how standard RL-based search agents operate (Section 3.1). Given a question-answer dataset \(D\) and a search engine \(E\), the goal is to train an LLM agent \(\pi_\theta\) to answer questions by reasoning and interacting with the search engine.

Under the ReAct paradigm, the agent executes cycles of:

Thought (τ): Internal reasoning based on current context
Action (α): External tool call, typically a search query
Observation (x): Feedback from the search engine (retrieved documents)

An interaction trajectory over \(T\) rounds is denoted:

\[a = (\tau_0, \alpha_0, x_0, \tau_1, \alpha_1, x_1, \ldots, \tau_{T-1})\]

The final thought \(\tau_{T-1}\) contains the answer \(o\). The standard RL objective maximizes:

\[J(\pi_\theta) = \mathbb{E}_{a \sim \pi_\theta}[f_{\text{verifier}}(o, o^*)]\]

where \(o^*\) is the ground-truth answer and \(f_{\text{verifier}}\) is a rule-based or model-based correctness function. The critical limitation: this is a sparse reward, received only at trajectory end, providing no signal about intermediate step quality.

The Meta-Episode Structure¶

MR-Search reorganizes the training process around meta-episodes—sequences of episodes where each episode conditions on all previous ones (Section 3.2). This is the structural innovation that enables dense supervision.

Definitions: - Episode: One complete interaction trajectory with a final answer, consisting of thoughts, tool calls, and observations - Reflection: An explicit text segment generated after each episode, analyzing what information was missing or what went wrong - Meta-episode: A sequence of \(N\) episodes with interleaved reflections

The process works as follows:

Episode 0: Given question \(x\), the agent generates an initial trajectory: \(\(a_0 \sim \pi_\theta(a)\)\)
Reflection: After episode 0, the agent is prompted to generate a reflection: \(\(\text{REFLECT}(C) \text{ where } C = a_0\)\)
Episode 1: The agent generates a new trajectory conditioned on episode 0 plus its reflection: \(\(a_1 \sim \pi_\theta(a_1 \mid a_0, \text{REFLECT}(a_0))\)\)
Continuation: This repeats for \(N\) episodes, with each episode conditioning on all prior episodes and reflections: \(\(a_n \sim \pi_\theta(a_n \mid a_{<n}, \text{REFLECT}(a_{<n}))\)\)

A meta-episode is thus:

\[y = (a_0, a_1, \ldots, a_N)\]

Why this structure matters: Each episode can be evaluated independently against the ground-truth answer, yielding a reward \(r_n = f_{\text{verifier}}(o_n, o^*)\) where \(o_n\) is the answer from episode \(n\). This transforms a single sparse reward problem into \(N\) evaluations per meta-episode.

The Meta-RL Objective¶

The paper defines a meta-level objective that maximizes expected cumulative rewards across the meta-episode (Equation 6):

\[J_{\text{meta}}(\pi_\theta) = \mathbb{E}_{y \sim \pi_\theta}\left[\sum_{n=0}^{N-1} \gamma^n R(s_n, a_n)\right] = \mathbb{E}_{y \sim \pi_\theta}\left[\sum_{n=0}^{N-1} \gamma^n f_{\text{verifier}}(o_n, o^*)\right]\]

Key components: - \(s_n = a_{<n}\): The accumulated meta-context up to episode \(n\) (all previous episodes and reflections) - \(a_n\): The action (episode) taken at step \(n\) - \(R(s_n, a_n)\): The reward for episode \(n\), computed as the verifier score for the answer - \(\gamma \in (0, 1]\): Discount factor for future returns

Design choice: Unless otherwise specified, \(\gamma = 1\) (no discounting). The paper notes that the discount factor can be set to propagate credit backward through episodes, which is tested in ablations.

The crucial difference from standard RL: the policy \(\pi_\theta\) now operates over a sequence of episodes, not a single trajectory. The agent learns to adapt its search strategy based on past failures, generating reflections that inform subsequent exploration.

Turn-Level Advantage Estimation¶

To optimize the meta-RL objective, the paper introduces a turn-level advantage estimation that remains critic-free while providing dense per-episode credit assignment (Section 3.3).

Step 1: Group-based reward normalization (RLOO). For each question, the algorithm samples a group of \(G\) meta-episodes: \(\mathcal{G} = \{y_i\}_{i=1}^G\). The default is \(G = 5\). For each episode position \(n\) within each meta-episode \(i\), compute a baseline-normalized reward using Leave-One-Out estimation:

\[\tilde{r}_{i,n} = r(s_{i,n}, a_{i,n}) - \frac{1}{G-1}\sum_{j \neq i} r(s_{j,n}, a_{j,n})\]

This subtracts the mean reward of all other meta-episodes at the same episode position, providing a relative performance measure. RLOO provides an unbiased estimate of the advantage (unlike GRPO which the paper notes is biased), following results from Ahmadian et al. (2024) and Bereket & Leskovec (2025).

Step 2: Discounted cumulative advantage. The per-episode relative reward \(\tilde{r}_{i,n}\) captures immediate effects but ignores future returns. To incorporate long-horizon dependencies, the algorithm computes a discounted cumulative advantage:

\[A_{i,n} = \sum_{n'=n}^{N} \gamma^{n'-n} \tilde{r}_{i,n'}\]

This propagates rewards backward from later episodes to earlier ones. If episode 3 succeeds (high reward), episodes 0-2 receive credit through the cumulative sum, enabling earlier reflection turns to receive gradient signal even if their immediate episode failed.

Why this works for credit assignment: Consider a scenario where episode 0 produces an incorrect answer due to a poor initial search query, but the reflection correctly identifies missing information, leading to a correct answer in episode 1. The cumulative advantage \(A_{i,0}\) includes rewards from both episodes 0 and 1, so the reflection generation in episode 0 receives positive gradient signal from episode 1's success.

PPO-Style Policy Optimization¶

With turn-level advantages computed, the policy is updated using a clipped surrogate objective adapted from PPO (Equation 9):

\[\frac{1}{G}\sum_{i=1}^G \frac{1}{|y_i|}\sum_{n=1}^{|y_i|} \min\left(\frac{\pi(y_{i,n} \mid x, y_{i,<n}; \theta)}{\pi(y_{i,n} \mid x, y_{i,<n}; \theta_{\text{old}})} A_{i,n}, \text{clip}\left(\frac{\pi(y_{i,n} \mid x, y_{i,<n}; \theta)}{\pi(y_{i,n} \mid x, y_{i,<n}; \theta_{\text{old}})}, 1-\varepsilon, 1+\varepsilon\right) A_{i,n}\right)\]

Components: - \(\pi(y_{i,n} \mid x, y_{i,<n}; \theta)\): Current policy probability of generating episode \(n\) given previous context - \(\pi(y_{i,n} \mid x, y_{i,<n}; \theta_{\text{old}})\): Old policy probability (from before the update) - \(\varepsilon\): Clipping ratio (standard PPO parameter) - \(A_{i,n}\): The turn-level advantage

The objective averages over: 1. All \(G\) meta-episodes in the group 2. All \(|y_i|\) episodes within each meta-episode

Design choice: Each episode's advantage is broadcast to all tokens in that episode. Following Shao et al. (2024), all tokens in episode \(n\) receive the same advantage signal \(A_{i,n}\). This is a simplification—more granular token-level advantages could be computed, but the paper uses episode-level uniform assignment.

Loss masking: Tool output tokens (observations returned by the search engine) are masked out from the loss. This follows Jin et al. (2025a) and ensures the policy is not trained to predict environment outputs, only its own actions.

The Complete Algorithm¶

Algorithm 1 in the paper summarizes the training procedure:

Sample a question-answer pair \((x, o^*)\) from dataset \(D\)
For each of \(G\) meta-episodes:
Initialize context \(C \leftarrow x\)
For each of \(N\) episodes:
- Sample episode \(a_{i,n} \sim \pi_\theta(\cdot \mid C)\)
- Generate reflection and append: \(C \leftarrow C \oplus a_{i,n} \oplus \text{REFLECT}(C)\)
- Compute reward \(r_{i,n} \leftarrow f(a_{i,n}, o^*)\)
Compute advantages \(A_{i,n}\) via Equations 7-8
Update policy \(\theta\) using the PPO-style objective (Equation 9)

Training hyperparameters (Appendix A.1.3): - Optimizer: AdamW - Learning rate: \(1 \times 10^{-6}\) (no warmup) - Sampling temperature and top-p: both set to 1.0 during rollout - Total training steps: 300 - Group size \(G\): 5 - Maximum tool calls per episode: 3 for NQ/HotpotQA, 5 for ASearcher - Context length: 8K tokens for NQ/HotpotQA, 16K for ASearcher

Extension 1: Exploration-Exploitation Masking¶

The paper introduces an optional extension for explicitly separating exploration and exploitation episodes (Section 3.4). Some episodes can be designated as "exploration" and receive zero reward during backpropagation:

\[A_{i,n} = \sum_{n'=n}^{N} \gamma^{n'-n} \tilde{r}_{i,n'} m_{n'}\]

where \(m_{n'} \in \{0, 1\}\) indicates whether episode \(n'\) is an exploitation (1) or exploration (0) episode.

Intuition: Exploration episodes serve purely as contextual adaptation—they provide information for subsequent episodes but don't directly contribute to the gradient. This prevents the policy from optimizing for short-term rewards in exploration episodes, instead encouraging it to view them as investments in long-term performance.

Experimental setup (Section 4.4): In experiments, the first two episodes are designated exploration and the last two exploitation. Results show this helps particularly for ASearcher, which requires more complex multi-turn search.

Extension 2: Step-Level Meta-RL¶

The meta-episode structure can be applied at finer granularity (Section 3.4). Instead of treating a complete trajectory as one episode, each tool-interaction step can be treated as a "micro-episode."

Implementation: During training, the model is prompted to produce an intermediate answer after each tool call. Each intermediate answer is evaluated against the ground truth, yielding step-level rewards. This transforms long trajectories into sequences of micro-episodes, each with its own reflection and evaluation.

Benefits: Dense supervision at the step level promotes informative intermediate reasoning and reduces redundant exploration. The paper notes this extension achieves strong performance (Table 3), though slightly below the full-episode MR-Search.

Extension 3: Context Management¶

A practical concern: context length grows linearly with the number of reflection steps. The paper tests a mitigation strategy—retaining only the immediately preceding episode as context rather than the full history (Section 4.4).

Results in Table 3 show "MR-Search Short Context" achieves 48.1% on NQ (vs. 50.2% for full context) but remains substantially above Search-R1's 45.9%. This suggests most of the benefit comes from the most recent reflection, reducing computational overhead with moderate performance cost.

Inference-Time Behavior¶

At test time, the trained agent: 1. Generates an initial episode (thoughts, tool calls, answer) 2. Reflects on the episode 3. Generates a new episode conditioned on the reflection 4. Repeats for up to \(N\) episodes (the paper uses 3 by default, testing extrapolation beyond)

Answer selection: The paper reports "the last valid prediction" (Section 4.1), suggesting the final episode's answer is used. However, the reflection mechanism allows the agent to preserve correct answers—"when the answer is already accurate...the model can preserve it and avoid unnecessary revisions" (Section 4.3).

Why This Approach Over Alternatives?¶

Vs. standard RL (Search-R1): MR-Search provides dense per-episode rewards instead of sparse outcome rewards, enabling credit assignment without external verifiers.

Vs. process reward models (PPRM, StepResearch): MR-Search requires no external annotations or reward models, avoiding reward hacking and computational overhead. The rewards come from ground-truth answer verification, which is available for training.

Vs. prompting-based self-reflection (Reflexion): MR-Search trains the model end-to-end to generate useful reflections and leverage them effectively. The reflection capability is learned, not just elicited from a frozen model.

Vs. PPO with value function: The RLOO-based advantage estimation is critic-free, eliminating the need for a separate value model while still providing per-turn credit assignment.

4. Key Insights and Innovations¶

Innovation 1: Meta-RL Formulation for Agentic Search with Self-Reflection¶

The paper's primary contribution is formalizing in-context meta-reinforcement learning as a practical framework for training agentic search systems. This bridges two previously disconnected literatures: meta-RL (which originated in robotics and games) and LLM-based search agents.

What makes this genuinely novel is the specific instantiation for the open-domain search setting. Prior meta-RL work assumed access to environment feedback during inference—the agent could take actions, observe rewards, and adapt. This paper shows that the same principle works when the only feedback is final answer correctness, by structuring training around explicit textual reflections that serve as the "memory" mechanism.

The reflection prompts are standardized (see Appendix A.1.3 for the exact template), and the meta-episode structure ensures reflections are trained to be informative. This is not simply applying meta-RL to a new domain—it requires reconceptualizing the episode structure and developing appropriate credit assignment mechanisms.

Innovation 2: Turn-Level Advantage Estimation Without a Value Function¶

The second key contribution is a critic-free algorithm for turn-level credit assignment in multi-episode settings. The combination of RLOO baseline normalization (Equation 7) and discounted cumulative advantages (Equation 8) provides dense per-episode learning signals without training a separate value function.

This matters because value functions for LLMs are expensive—they require additional model parameters, careful training to avoid overfitting, and introduce bias if mis-calibrated. The RLOO approach uses the group structure naturally: by comparing each meta-episode to others in the same batch, the algorithm estimates relative quality without an explicit value model.

The theoretical grounding (unbiased advantage estimation) is cited from prior work, but the application to the meta-episode setting with backward credit propagation is novel. The discount factor enables later successes to provide gradient signal to earlier reflections, which is critical for learning effective reflection generation.

Innovation 3: Training Reflections End-to-End Rather Than Prompting¶

A conceptual contribution that may be underappreciated: the paper demonstrates that reflection capability can be trained rather than elicited. Prior self-reflection methods (Reflexion, Self-Refine) used frozen models with carefully designed prompts. This paper shows that optimizing for cross-episode improvement via meta-RL produces models that naturally learn to generate useful reflections.

The training process shapes reflection behavior implicitly—the model is never explicitly taught what a "good" reflection looks like. Instead, gradients from subsequent episode successes backpropagate through reflection tokens, reinforcing reflections that lead to improvement. This is validated qualitatively in the case studies (Tables 4-7), where reflections correctly identify missing information and guide subsequent searches.

Innovation 4: Cross-Episode Structure Enables Test-Time Scaling Without Retraining¶

The paper shows that meta-RL training produces agents capable of test-time extrapolation beyond the training horizon. Agents trained with \(N = 3\) episodes can effectively use additional reflection turns at inference time, with performance improving beyond the training configuration (Figures 3 and 5).

This contrasts sharply with Search-R1 with post-hoc reflection prompting, which shows only marginal gains from additional turns because the underlying policy was never trained to leverage reflections. The paper writes: "the single-turn method Search-R1 with our reflection mechanism yields only marginal gains when additional reflection turns are allowed, since its training objective is optimized for a single turn" (Section 4.3).

This finding has practical implications: organizations can train agents with moderate compute budgets (few episodes) and reap benefits at test time by scaling reflection depth, without retraining.

5. Experimental Analysis¶

Evaluation Methodology¶

Datasets. The paper evaluates on eight benchmarks across two categories:

Single-hop QA (3 datasets): - NQ (Natural Questions): 79,168 training, 3,610 test samples (Kwiatkowski et al., 2019) - TriviaQA: 11,313 test samples (Joshi et al., 2017) - PopQA: Subset of 14,267 entity-centric QA pairs (Mallen et al., 2022)

Multi-hop QA (4 datasets): - HotpotQA: 90,447 training, 7,405 test samples (Yang et al., 2018) - 2WikiMultiHopQA: 7,405 test pairs (Ho et al., 2020) - Musique: 2,417 test pairs (Trivedi et al., 2022) - Bamboogle: 125 test pairs (Press et al., 2022)

Complex synthetic dataset: - ASearcher: Synthetic multi-turn dataset with 14k samples, split 90%/10% for training/evaluation (Gao et al., 2025)

Training data. All finetuning methods use merged NQ + HotpotQA training sets, following Jin et al. (2025b). ASearcher has a separate training split.

Base models. Qwen2.5-3B-Base and Qwen2.5-7B-Base (Yang et al., 2024).

Retrieval setup. 2018 Wikipedia dump indexed with E5 embeddings (Wang et al., 2022a), retrieving top-3 documents per query. This is fixed across all methods.

Evaluation metric. Exact Match (EM) score—predicted answer must exactly match a ground-truth answer after normalization. The paper reports "average EM for the last valid prediction" per question.

Inference protocol. Greedy decoding (temperature = 0, top-p = 1.0) at test time, with 3 reflection turns by default.

Baselines. Three categories:

Inference without finetuning:
Direct Inference (base model, no retrieval)
Search-o1 (Li et al., 2025b): Agentic RAG with reason-in-document module
Finetuning without step-level supervision:
ReSearch (Chen et al., 2025): RL-based framework with explicit search actions
Search-R1 (Jin et al., 2025a): RL training with GRPO under ReAct paradigm
Finetuning with step-level supervision:
PPRM (Anonymous, 2026): Principle process reward model with GRPO
StepResearch (Wang et al., 2025b): Step-wise PPO with intermediate rewards

Compute resources. 8× NVIDIA H100 (80GB) GPUs for RL training, plus 2× H100 for retriever serving.

Main Quantitative Results¶

Table 1: Benchmark Performance¶

Qwen2.5-3B-Base results:

Method	NQ	TriviaQA	PopQA	HotpotQA	2wiki	Musique	Bamboogle	Avg
Direct Inference	10.6	28.8	10.8	14.9	24.4	2.0	2.4	13.4
Search-o1	23.8	47.2	26.2	22.1	21.8	5.4	32.0	25.5
ReSearch	42.7	59.7	43.0	30.5	27.2	7.4	11.5	30.4
Search-R1	46.2	62.2	45.6	32.6	31.0	7.7	17.6	34.7
PPRM	42.3	56.5	41.1	35.3	34.0	12.7	28.0	35.7
StepResearch	44.6	61.5	45.6	37.3	33.8	10.5	32.5	38.0
MR-Search	47.7	63.5	46.0	41.9	40.1	16.5	34.4	41.4

Relative improvement over Search-R1: MR-Search achieves 41.4% average vs. 34.7% for Search-R1, a 19.3% relative improvement (calculated as \((41.4 - 34.7) / 34.7\)).

Qwen2.5-7B-Base results:

Method	NQ	TriviaQA	PopQA	HotpotQA	2wiki	Musique	Bamboogle	Avg
Direct Inference	13.4	40.8	14.0	18.3	25.0	3.1	12.0	18.1
Search-o1	15.1	44.3	13.1	18.7	17.6	5.8	29.6	20.6
ReSearch	36.6	60.5	39.1	37.8	38.6	16.6	37.6	38.1
Search-R1	45.9	63.2	44.9	43.9	38.7	18.1	40.0	42.1
PPRM	45.8	61.0	43.7	38.6	35.5	14.7	35.5	39.3
StepResearch	47.3	63.6	43.1	43.9	41.8	20.5	43.5	43.4
MR-Search	50.2	66.6	47.2	46.8	43.6	22.1	45.2	46.0

Relative improvement over Search-R1: MR-Search achieves 46.0% average vs. 42.1% for Search-R1, a 9.2% relative improvement.

Key observations: - The improvement is larger for the smaller 3B model (19.3%) than the 7B model (9.2%), suggesting meta-RL is particularly beneficial when base capability is limited - MR-Search outperforms methods requiring external process supervision (PPRM, StepResearch), demonstrating the effectiveness of free process rewards from reflections - Multi-hop QA benchmarks (HotpotQA, 2wiki, Musique) show substantial gains, consistent with the intuition that complex multi-turn search benefits most from cross-episode learning

Figure 4: ASearcher Results¶

The ASearcher dataset requires longer-horizon, multi-turn search. MR-Search achieves 10.2% relative improvement in EM and 9.5% improvement in F1 over Search-R1 (Section 4.2).

Training dynamics (Figure 4) show: - MR-Search achieves higher training reward consistently throughout training - MR-Search calls the search engine more frequently, dynamically adjusting search behavior based on task complexity - Test accuracy converges faster and to higher values for MR-Search

Ablation Studies¶

Table 2: Algorithm Components¶

Discount factor: Setting \(\gamma = 0\) (removing future credit assignment) degrades performance from 46.0% to 43.8% average on 7B. The paper notes: "removing the discount factor substantially degrades performance and causes the training process to converge to poor local optima" (Section 4.2).

Training algorithm comparison:

Method	Avg
ReSearch	38.1
Search-R1	43.5
MR-Search w. PPO	42.0
MR-Search w. MT-GRPO	44.3
MR-Search	46.0

Both PPO and MT-GRPO underperform the proposed RLOO-based advantage estimation. Notably, MR-Search with PPO actually underperforms Search-R1 on single-hop NQ and PopQA, indicating the optimization algorithm matters significantly.

Table 3: Design Variants¶

Method	NQ	TriviaQA	PopQA	HotpotQA	2wiki	Musique	Bamboogle	ASearcher
Search-R1	45.9	63.2	44.9	43.9	38.7	18.1	40.0	36.9
MR-Search	50.2	66.6	47.2	46.8	43.6	22.1	45.2	41.3
MR-Search Exploration	48.3	65.1	46.4	44.7	39.4	21.8	44.0	43.2
MR-Search Step Level	48.6	64.6	45.7	42.3	41.4	16.3	41.6	38.4
MR-Search Short Context	48.1	65.9	45.2	44.6	41.0	19.3	47.2	40.5

Observations: - Exploration-exploitation masking helps ASearcher (43.2% vs. 41.3%), consistent with the need for more exploration in complex tasks - Step-level meta-RL provides gains over Search-R1 but underperforms full-episode MR-Search on most benchmarks - Short context management degrades performance but remains above Search-R1

Test-Time Scaling Analysis¶

Figures 3 and 5 demonstrate that MR-Search scales effectively with additional reflection turns at inference time:

MR-Search performance increases steeply from turn 1 to turn 4
Search-R1 with post-hoc reflection (Search-R1-S) shows only marginal gains
Search-R1 with parallel sampling (Search-R1-P) plateaus

The paper writes: "MR-Search achieves significantly higher performance, exhibiting the steep improvement curve. These results suggest that multi-turn reflection with MR-Search enhances the model's ability to iteratively refine and optimize its search across turns and enables effective extrapolation" (Section 4.3).

Assessment: Do the Experiments Support the Claims?¶

Claim 1: MR-Search substantially outperforms baselines. Strongly supported. Across eight benchmarks and two model sizes, MR-Search achieves consistent improvements of 9.2% to 19.3% over Search-R1. The gains are particularly pronounced on multi-hop QA, where the meta-episode structure addresses the credit assignment challenge.

Claim 2: The approach works without external process supervision. Supported. MR-Search outperforms PPRM and StepResearch, which require step-level annotations or external reward models. This validates the core premise that cross-episode reflection provides effective process supervision for free.

Claim 3: Trained reflections enable test-time scaling. Supported qualitatively in Figures 3 and 5, though exact numerical improvements from additional turns are not reported in table form. The performance curves show clear extrapolation beyond the 3-turn training configuration.

Potential limitations: - The paper does not report statistical significance or variance estimates for the main results, making it difficult to assess reproducibility - Baselines like PPRM and StepResearch use different training configurations; fairness of comparison depends on implementation details not fully specified - The 3B model shows larger gains than 7B, but it's unclear whether this pattern would hold for larger models - All experiments use Wikipedia search; generalization to other tools (web search, code execution) is not tested

6. Limitations and Trade-offs¶

Assumption: Ground-Truth Answers Available During Training¶

The entire MR-Search framework relies on access to ground-truth answers \(o^*\) during training to compute verifier rewards \(f_{\text{verifier}}(o_n, o^*)\) for each episode. This is standard for QA benchmarks but creates a fundamental dependency: the method cannot be directly applied to tasks without clean correctness signals. The paper explicitly acknowledges this limitation (Section 5):

"We do not evaluate our method on long-form benchmarks, where responses are substantially longer. Verification in such settings is inherently challenging, and how to reliably assess progress and final correctness for long-form generation remains an open research question."

This constraint limits applicability to domains like open-ended generation, creative writing, or complex planning tasks where correctness is subjective or multi-dimensional. Extending MR-Search to such settings would require either surrogate reward functions (introducing potential reward hacking) or human feedback during training (which is costly).

Scalability: Context Length Grows Linearly with Reflection Turns¶

The paper notes that "MR-Search conditions on trajectories and explicit reflections from all previous episodes, causing the context length to increase linearly with the number of reflection steps \(N\)" (Section 3.2). This has practical consequences:

With \(N = 3\) episodes, each containing multiple tool calls and observations, context can easily exceed 8K tokens for complex queries
The ASearcher experiments require 16K context windows, increasing memory and compute costs
Extrapolation to more turns (as shown in Figures 3 and 5) compounds this problem

The paper tests a mitigation—retaining only the immediately preceding episode—but this degrades performance from 50.2% to 48.1% on NQ (Table 3). There is a fundamental tradeoff: richer reflection history improves performance but increases computational cost. More sophisticated context management (summarization, selective retention) is left for future work.

Single Tool Domain: Wikipedia Search Only¶

All experiments use a fixed Wikipedia search engine with E5-based retrieval. The paper acknowledges this limitation (Section 5):

"Our current study focuses on agentic search with a fixed Wikipedia search tool. Extending MR-Search to environments involving multiple heterogeneous tools, such as combined web search and web browsing, we leave...to future work."

This narrow scope raises questions about generalization: - Would reflections trained for Wikipedia search transfer to web search with different result structures? - Can the framework handle tool selection decisions (e.g., choosing between search and code execution)? - How does performance scale with retrieval quality—would weaker retrievers benefit more or less from meta-RL?

The paper provides no evidence on these questions, leaving transfer to other tool environments as an open problem.

Training Compute and Sample Efficiency Not Analyzed¶

The paper reports strong final performance but does not analyze training efficiency. Key questions are unaddressed: - How many gradient steps are required for MR-Search to converge versus Search-R1? - What is the per-step computational cost given the meta-episode structure? - Is the 300-step training budget comparable across methods?

Figure 4 shows MR-Search achieves higher training reward, but the x-axis is "training step"—it is unclear whether each step represents equivalent computational cost. Meta-episodes require generating \(N\) trajectories per question with \(G\) groups, potentially increasing sample complexity substantially compared to single-episode methods.

The paper mentions compute resources (8× H100 for training, 2× H100 for retrieval) but does not report training duration, FLOPs, or comparison to baseline compute requirements.

Reflection Quality Not Directly Evaluated¶

The case studies (Tables 4-7) qualitatively show reflections identifying missing information, but no quantitative evaluation of reflection quality is provided. Important questions remain:

What fraction of reflections correctly diagnose the failure mode?
Do reflections that identify missing information actually lead to improved subsequent episodes?
Can we predict when reflections will be helpful versus misleading?

The ablation removing future credit assignment (\(\gamma = 0\)) degrades performance, suggesting backward credit propagation is important. However, without direct reflection evaluation, it is unclear whether this is because early reflections are generally high-quality, or because the training process learns to weight reflections appropriately regardless of quality.

The Discount Factor Remains an Unexplored Hyperparameter¶

The paper sets \(\gamma = 1\) by default and tests \(\gamma = 0\) in ablation (Table 2), showing degradation. However, no analysis of intermediate values (e.g., \(\gamma \in \{0.5, 0.7, 0.9\}\)) is provided. The discount factor controls how strongly later success backpropagates to earlier episodes, which is central to the credit assignment mechanism.

The ablation result suggests credit propagation matters, but the optimal discounting schedule is not characterized. Different task types (single-hop vs. multi-hop, short vs. long horizons) might benefit from different \(\gamma\) values, but this remains unexplored.

Model Scale Limited to 7B Parameters¶

All experiments use Qwen2.5-3B and 7B models. The paper notes in limitations:

"It would be also particularly interesting to scale MR-Search to large agentic RL training runs and further study the scaling properties of Meta-RL with frontier base models."

This is a significant gap. The relative improvement is larger for 3B (19.3%) than 7B (9.2%), suggesting diminishing returns as base capability increases. Whether this trend continues for larger models—or whether frontier models exhibit different scaling behavior—is unknown.

Additionally, the comparison to methods like PPRM and StepResearch (which may use different base models or training configurations) could be affected by model-specific properties not captured in this evaluation.

No Analysis of Failure Modes¶

The paper shows positive results but does not systematically analyze when or why MR-Search fails. Questions include:

What types of questions benefit most/least from reflection?
Does MR-Search introduce new failure modes (e.g., over-confident incorrect reflections)?
How does performance degrade with increasing question difficulty?

Figure 4 shows MR-Search calls the search engine more frequently, suggesting it learns to be more thorough—but this could also indicate inefficient exploration in some cases. Without failure analysis, it is difficult to assess robustness or identify where further improvements are needed.

The Correct-to-Incorrect Reversion Problem (Implicit)¶

While not explicitly discussed, the case studies reveal a potential issue: in Table 4, the model initially answers "2015" (incorrect), then after reflection, answers "2002" (correct), but then produces another reflection and answers "2002" again. This suggests the model may sometimes preserve correct answers but could potentially revert correct to incorrect in other cases.

The paper notes that "when the answer is already accurate...the model can preserve it and avoid unnecessary revisions" (Section 4.3), but provides no quantitative analysis of how often correct answers are preserved versus corrupted across reflection steps. This is a critical practical concern for deployment.

7. Implications and Future Directions¶

How This Work Changes the Landscape¶

MR-Search establishes a new paradigm for training agentic systems: meta-RL with explicit self-reflection as a scalable alternative to external process supervision. This has several implications for the field:

1. Process rewards can be "free." The paper demonstrates that dense supervision does not require expensive human annotations or trained process reward models. By structuring training around meta-episodes with explicit reflections, ground-truth answer verification at the episode level provides sufficient signal for learning fine-grained search strategies. This could significantly reduce the cost of training agentic systems while avoiding reward hacking issues associated with model-based rewards.

2. Reflection capability is trainable, not just elicitable. Prior self-reflection methods (Reflexion, Self-Refine) treated reflection as a capability of frozen models, accessed through careful prompting. MR-Search shows that end-to-end training shapes reflection behavior—the model learns what information to include in reflections through gradient signals from subsequent episode success. This opens a new research direction: optimizing reflection generation for downstream task performance.

3. Test-time scaling can be "baked in" through training. The extrapolation results (Figures 3, 5) show that agents trained with meta-RL can effectively use additional reflection turns at inference time without retraining. This is practically significant: organizations can train agents with moderate episode budgets and scale compute at deployment time, achieving returns that prompting-based reflection cannot match.

4. Cross-episode structure addresses the credit assignment problem without architectural changes. The sparse reward problem has been approached through process reward models, hierarchical RL, and other complex interventions. MR-Search shows that reorganizing the training objective—from independent episodes to meta-episodes—can achieve dense credit assignment with minimal architectural overhead.

Follow-Up Research This Work Enables¶

1. Extending to multi-tool environments. The paper explicitly identifies this direction. A natural extension is training agents to reflect not only on answer quality but also on tool selection and query formulation across heterogeneous tools (web search, code execution, database queries). The meta-RL framework should generalize—each tool interaction can be treated as an episode—but the reflection prompts and reward structures would require redesign.

2. Dense reward functions beyond answer correctness. The current framework relies on answer verification, but other dense signals could enhance training: - Retrieval relevance scores (e.g., BM25 or embedding similarity to relevant passages) - Query efficiency metrics (minimizing tool calls while maintaining accuracy) - Intermediate reasoning quality assessments from small verifier models

The key constraint is avoiding reward hacking—the paper's approach of using ground-truth verification is robust because it cannot be gamed, but auxiliary signals would introduce new risks.

3. Adaptive reflection depth. The fixed \(N = 3\) episodes is a training hyperparameter, but adaptive policies could dynamically decide when to stop reflecting based on confidence or progress estimates. This connects to the exploration-exploitation masking variant tested in Table 3, but with learned rather than fixed boundaries.

4. Reflection distillation for smaller models. A practical research direction: train a large model with MR-Search, then distill its reflection behavior into a smaller model. The smaller model might not achieve the same performance but could retain much of the reflection capability at lower inference cost.

5. Combining with other RL advances. MR-Search uses PPO-style optimization, but the turn-level advantage estimation is agnostic to the specific optimizer. Combining with recent advances like GRPO variants, KTO, or DPO-style approaches could yield further improvements.

6. Theoretical analysis of credit assignment. The empirical results show the discount factor and cumulative advantage computation matter, but theoretical grounding is limited. Formal analysis of when and why backward credit propagation helps—in different task structures, discount settings, and horizon lengths—would provide guidance for hyperparameter selection.

Practical Applications and Downstream Use Cases¶

1. Deep research assistants. The paper's motivation connects directly to "deep research" systems (Du et al., 2025; Shao et al., 2025) that conduct multi-step information gathering. MR-Search provides a training recipe for agents that can iteratively refine their understanding through self-directed exploration, reducing reliance on external process supervision.

2. Customer support and FAQ systems. Multi-hop questions are common in customer support scenarios where answers require synthesizing information across multiple documents or knowledge base entries. Agents trained with MR-Search could more effectively navigate complex queries, improving first-contact resolution rates.

3. Legal and medical literature review. Professional domains often require synthesizing information across multiple sources. While the current work uses Wikipedia, the framework could extend to domain-specific corpora (legal databases, medical literature) where multi-hop reasoning is essential.

4. Educational question-answering. Complex educational questions often require connecting concepts across topics. MR-Search's ability to refine searches based on reflection could improve educational QA systems.

Reproduction and Integration Guidance¶

When to prefer MR-Search over alternatives:

Prefer MR-Search when:
Ground-truth answers are available for training (standard QA setup)
Multi-turn search is required (multi-hop questions, complex reasoning)
External process supervision is unavailable or too costly
Test-time scaling is desired (agents should improve with additional compute at deployment)
Prefer external process supervision (PPRM, StepResearch) when:
Task-specific process knowledge is critical and can be reliably annotated
The reward hacking concerns are mitigated (e.g., robust verifier design)
Training efficiency is paramount (meta-episodes increase sample complexity)
Prefer prompting-based reflection (Reflexion, Self-Refine) when:
No training compute is available (frozen models only)
The task domain is well-matched to existing reflection prompts
Quick deployment is prioritized over optimal performance

Practical implementation considerations:

Compute budget: Meta-episode training requires \(N \times G\) trajectories per question per batch. For \(N = 3\) and \(G = 5\), this is 15× more rollouts than single-episode methods. Plan GPU memory accordingly.
Context window: Use at least 8K context for standard QA tasks; 16K+ for complex multi-turn reasoning. Consider the short-context variant if memory is constrained.
Discount factor: Start with \(\gamma = 1\) and ablate. The results suggest full credit propagation is beneficial, but specific tasks may differ.
Reflection prompt design: The paper provides exact templates (Appendix A.1.3). The prompt instructs the model to "reflect on your current answer" and "search for additional external information." Customization for specific domains may improve performance.
Group size: The default \(G = 5\) provides a balance between variance reduction and computational cost. Larger groups yield more stable advantage estimates but require more samples.