In-Context Reinforcement Learning for Tool Use in Large Language Models¶

Pitch¶

Training LLMs to use external tools typically requires expensive supervised fine-tuning data, but this paper introduces ICRL—a framework that eliminates SFT entirely by embedding few-shot demonstrations directly into RL rollout prompts. By gradually reducing these in-context examples during training, the model transitions from imitation to autonomous tool use, achieving state-of-the-art results with zero labeled supervision.

1. Executive Summary¶

This paper introduces In-Context Reinforcement Learning (ICRL), a framework that trains large language models to use external tools (search engines, code interpreters) without requiring supervised fine-tuning (SFT) or labeled tool-use trajectories. The core innovation is a curriculum that embeds few-shot demonstrations directly into RL rollout prompts and gradually removes them, enabling the model to transition from imitation to autonomous tool use. ICRL achieves state-of-the-art results across five QA benchmarks, outperforming strong baselines like ZeroSearch and Search-R1 by +8.94 average EM on Qwen2.5-3B and +7.34 average EM on Qwen2.5-7B, while matching or exceeding SFT+RL methods like ReTool on math reasoning benchmarks despite using zero labeled supervision.

2. Context and Motivation¶

The Core Problem: Training LLMs to Use Tools Is Data-Hungry¶

Large language models excel at reasoning within their pretrained knowledge, but they fail when tasks require information beyond their training cutoff or capabilities beyond text generation. A compelling solution is to augment LLMs with external tools—search engines for up-to-date information, Python interpreters for computation. However, teaching models when and how to invoke these tools effectively remains a significant challenge.

The dominant training paradigm follows a cold-start pipeline: first apply supervised fine-tuning (SFT) on labeled tool-use trajectories, then refine with reinforcement learning (RL). This approach has a critical weakness: it requires substantial amounts of high-quality labeled data showing correct tool invocations. Annotating such data is expensive, and synthetic generation often produces noisy or unrealistic trajectories.

Why This Problem Matters¶

The paper identifies several practical motivations:

Deployment cost: Collecting annotated tool-use trajectories for every new domain or tool type is prohibitively expensive, limiting the scalability of tool-augmented models.
Exploration failure: Directly applying RL from scratch (without SFT) fails because the model has no initial tool-use capability and cannot discover successful tool-calling behaviors through random exploration.
Frozen prompts vs. learned policies: Existing few-shot prompting approaches can guide models to use tools at inference time, but they require costly prompt engineering, consume context window space, and don't adapt to the specific task distribution.

Prior Approaches and Their Limitations¶

The paper surveys three categories of prior work:

Direct prompting methods (Chain-of-Thought, direct inference) require no training but perform poorly on complex reasoning tasks. As shown in Table 3, direct prompting on Qwen2.5-7B achieves only 19.84 average EM on difficult QA datasets, compared to ICRL's 49.12.

Retrieval-based methods (RAG, IRCoT, Search-o1) augment generation with retrieved documents but use fixed retrieval policies that don't adapt to query complexity or learn optimal search strategies.

Fine-tuning methods form the strongest baselines: - SFT requires labeled tool traces but still underperforms (17.64 average EM on Qwen2.5-3B). - RL-only methods like DeepSeek-R1 apply RL directly but struggle with exploration on tool-use tasks. - SFT+RL hybrids like Search-R1, ZeroSearch, O2-Searcher, and ReTool achieve strong results but require costly SFT phases with thousands of annotated examples.

A particularly instructive comparison is O2-Searcher, which uses cold-start SFT before RL and achieves 37.26 average EM on Qwen2.5-3B—ICRL achieves 40.16 with no SFT at all.

How This Paper Positions Itself¶

ICRL's central insight is that few-shot prompting and reinforcement learning can be unified: instead of treating few-shot demonstrations as a static inference-time technique, ICRL embeds them directly into the RL training process. The demonstrations provide initial guidance during exploration (solving the cold-start problem), and the curriculum gradually removes them, forcing the model to internalize tool-use strategies.

The paper explicitly positions ICRL as: - A data-efficient alternative to SFT+RL pipelines, eliminating the need for labeled trajectories. - A unified framework that works across diverse tool types (web search, code execution). - A curriculum learning approach that transitions from imitation (few-shot) to autonomy (zero-shot).

3. Technical Approach¶

3.1 Reader Orientation¶

ICRL is a training framework for teaching language models to call external tools through reinforcement learning, where the key mechanism is embedding few-shot tool-use demonstrations into the RL rollout prompts and gradually phasing them out over training stages.

3.2 Big-Picture Architecture (Diagram in Words)¶

The system has four major components:

LLM Policy — the model being trained (Qwen2.5 variants) that generates reasoning steps, tool calls, and final answers.
Tool Environment — external systems (Serper API for web search, Python interpreter for math) that process tool invocations and return observations.
Rollout Template — the prompt construction that includes task instructions plus N few-shot demonstrations; N decreases over training stages.
GRPO Optimizer — the reinforcement learning algorithm (Group Relative Policy Optimization) that updates the policy based on rewards, with loss masking to exclude non-model-generated tokens.

Information flows as follows: a question enters the system → the rollout template prepends N demonstrations → the LLM generates a trajectory (reasoning, tool calls, answer) → tools execute and return observations → a composite reward (accuracy + format) is computed → GRPO updates the policy using only model-generated tokens.

3.3 Roadmap for the Deep Dive¶

First, the formal problem formulation of tool-augmented LLMs as a Markov Decision Process.
Second, the RL objective and GRPO algorithm with loss masking.
Third, the ICRL training process—how demonstrations are embedded and the curriculum of reduction.
Fourth, the reward design combining accuracy and format correctness.
Fifth, implementation details including hyperparameters and training infrastructure.

3.4 Detailed, Sentence-Based Technical Breakdown¶

This is primarily a methods paper whose core idea is that few-shot demonstrations can serve as soft supervision during RL exploration, eliminating the need for supervised pre-training.

Formal Problem Formulation: Tool Use as a Conditional Generation Process¶

The paper models tool-augmented reasoning as a conditional generation problem where the model's response depends not only on the query and previous tokens, but also on the history of tool interactions. Given a query \(q\) and an external tool \(T\), the model generates a response \(y = (y_1, y_2, \ldots, y_{|y|})\) according to:

\[\pi_\theta(y | q, T) = \prod_{t=1}^{|y|} \pi_\theta(y_t | y_{<t}, q, H_t)\]

where \(\pi_\theta\) is the policy model parameterized by \(\theta\), and \(H_t\) denotes the sequence of previous actions and corresponding tool observations up to step \(t\).

The interaction follows a structured format using XML tags: reasoning steps appear in lude reasoning steps from internal thinking (<thought>), search queries in <search>...</search>, tool responses in <information>...</information>, and final answers in <answer>...</answer>. This structure enables parsing and validation.

The tool itself is modeled as a response function. For a search engine, given a query \(q'\), the tool returns an observation \(o = T(q')\) such as the top-k retrieved documents. This observation is appended to the model's context for subsequent generation.

RL Objective Function with Tool Use¶

The paper formulates the learning objective as maximizing expected reward with a KL-divergence regularization to prevent the policy from deviating too far from a reference model:

\[\max_{\pi_\theta} \mathbb{E}_{q \sim \mathcal{D}, y \sim \pi_\theta(\cdot|q,T)} [r_\phi(q, y)] - \beta \cdot D_{KL}[\pi_\theta(y | q, T) \| \pi_{\text{ref}}(y | q, T)]\]

where \(\pi_\theta\) is the policy LLM, \(\pi_{\text{ref}}\) is the reference LLM (typically the initial policy before training), \(r_\phi\) is the reward function, and \(D_{KL}\) is the KL-divergence measuring distributional distance.

Loss Masking: Excluding Tool Outputs from Optimization¶

A critical design choice is loss masking—excluding tool-returned content from the gradient computation. During tool-augmented rollouts, retrieved documents (inside <information> tags) are inserted into the context but were not generated by the model. Optimizing over these tokens would be meaningless since the model cannot control what the tool returns.

The solution: only tokens generated by the language model contribute to the policy gradient, while retrieved spans are masked out. This ensures learning focuses on the model's decisions—when to call tools, what queries to issue, how to reason—rather than on external content.

GRPO (Group Relative Policy Optimization) for Tool Use¶

The paper adopts GRPO, an algorithm introduced in DeepSeekMath, to optimize the policy. The key idea is sampling a group of trajectories per query and computing advantages relative to the group mean, which reduces variance without requiring a separate value function model.

For each query \(q\), the algorithm samples \(N\) trajectories \(\{\tau_1, \ldots, \tau_N\}\) from the old policy \(\pi_{\theta_{\text{old}}}\). The loss is:

\[\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{\tau_i \sim \pi_{\theta_{\text{old}}}(q), q \sim \mathcal{D}_{\text{RL}}} \left[ \frac{1}{\sum_{i=1}^{N} |\tau_i|} \sum_{i=1}^{N} \sum_{t=1}^{|\tau_i|} \text{CLIP}(r_{i,t}(\theta), A_i, \epsilon) - \beta \cdot D_{KL}[\pi_\theta \| \pi_{\text{ref}}] \right]\]

where the advantage \(A_i\) for each trajectory is computed by normalizing rewards across the group:

\[A_i = \frac{R(\tau_i) - \text{mean}(\{R(\tau_i)\})}{\text{std}(\{R(\tau_i)\})}\]

and \(r_{i,t}(\theta) = \pi_\theta(\tau_{i,t} | q, \tau_{i,<t}) / \pi_{\theta_{\text{old}}}(\tau_{i,t} | q, \tau_{i,<t})\) is the importance weight.

This group-relative advantage computation means high-reward trajectories get pushed up and low-reward trajectories get pushed down, all without training a separate critic network.

The ICRL Training Process: In-Context Demonstrations as Soft Supervision¶

The core innovation is embedding few-shot demonstrations directly into RL rollouts and gradually removing them. At training start, the rollout prompt includes N demonstration examples:

\[\pi_\theta(y | P_N, q, T) = \prod_{t=1}^{|y|} \pi_\theta(y_t | P_N, y_{<t}, q, H_t)\]

where \(P_N\) represents the few-shot prompt consisting of N demonstration examples.

Rollout template structure (Table 1): The template includes: 1. Task instructions: "Solve the following problem step by step. You must conduct reasoning inside lude reasoning steps from internal thinking (<thought>) every time you get new information..." 2. N demonstration examples showing complete reasoning trajectories with tool calls. 3. The actual problem to solve.

Curriculum of reduction: After training for several steps with N demonstrations, training pauses and the prompt is reduced to \(P_{N-1}\) (one fewer example). This process repeats until zero demonstrations remain. The final policy operates in a zero-shot setting:

\[\pi_\theta(y | P_0, q, T) = \prod_{t=1}^{|y|} \pi_\theta(y_t | y_{<t}, q, H_t)\]

Why this works: The few-shot demonstrations serve as "soft supervision" during early exploration. Instead of randomly flailing to discover tool-use behaviors, the model sees examples of correct tool calls and can imitate them. As training progresses, the model internalizes the patterns and no longer needs the scaffolding. The gradual reduction forms a curriculum from imitation (few-shot) to autonomy (zero-shot).

Demonstration construction: The paper uses only three questions randomly sampled from the web, with GPT-5.2 generating the demonstration trajectories formatted according to the rollout template. This minimal data requirement (just 3 examples) contrasts sharply with SFT methods requiring thousands of labeled trajectories.

Reward Design: Combining Accuracy and Format Correctness¶

The reward function has two components:

\[r_\phi(q, y) = \alpha \cdot \text{reward}_{\text{acc}} + (1 - \alpha) \cdot \text{reward}_{\text{format}}\]

where \(\alpha = 0.8\) in experiments, heavily weighting accuracy while still providing signal for format learning.

Accuracy reward: Binary exact match (EM) between predicted and ground-truth answer—1 if correct, 0 otherwise.

Format reward: Penalizes violations of the expected XML structure:

\[\text{reward}_{\text{format}} = 1.0 - \sum_{v \in V} \text{penalty}(v)\]

where \(V\) is the set of format violations. Penalties (Table 2): - No <answer> tag: 0.5 (most severe—must provide structured answer) - Unbalanced <answer> tags: 0.2 - No lude reasoning steps from internal thinking (<thought>) tag: 0.15 - Unbalanced lude reasoning steps from internal thinking (<thought>) tags: 0.1 - No <search> usage: 0.1 - Empty answer content: 0.2

This design ensures the model learns both to use tools correctly and to format outputs properly.

Implementation Details and Hyperparameters¶

Models: Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-14B-Instruct, Qwen3-8B—all loaded in bfloat16 precision.

Training infrastructure: - Framework: Volcano Engine Reinforcement Learning (VeRL) - GPUs: 4 × NVIDIA A100 (80GB each) - Precision: bfloat16 - Training approach: Fully Sharded Data Parallel (FSDP) with gradient checkpointing

Hyperparameters: - Learning rate: \(1 \times 10^{-6}\) (note: the paper states "1e-6" but Section 3.1 says "1e-5"; the context suggests \(10^{-6}\) for RL) - Batch size: 64 - KL penalty coefficient \(\beta\): 0.001 - Number of rollout trajectories per query: 8 - Sampling temperature: 1.0 - Maximum prompt length: 5000 tokens - Maximum response length: 2048 tokens - Maximum search turns per query: 6 - Tool retrieval: BM25 returning top-3 documents per query

Training dataset: Natural Questions (NQ) loaded via FlashRAG, with gold-standard answers. NQ is excluded from evaluation to prevent data leakage.

Evaluation datasets: TriviaQA, HotpotQA, 2Wiki, Musique, Bamboogle—sampled up to 500 questions each.

Curriculum Design: Three-Stage vs. Four-Stage¶

The paper experiments with two curriculum schedules:

Three-stage (3→2→0): Start with 3 demonstrations, reduce to 2, then to 0. The jump from 1 to 0 is skipped.

Four-stage (3→2→1→0): A more gradual reduction including an intermediate stage with 1 demonstration.

Ablation result (Figure 2): The three-stage curriculum performs substantially better. On Qwen2.5-7B, the three-stage model achieves 75.4 EM on TriviaQA versus only 20.8 with four-stage—a dramatic gap. The four-stage curriculum leads to faster decisions (80% of queries finish within 2 turns) but lower quality, suggesting that aggressive reduction encourages premature stopping and weakens multi-turn reasoning.

Why GRPO Over Other RL Algorithms?¶

The paper doesn't explicitly compare algorithms, but GRPO offers several advantages for tool-use learning:

No value function needed: Traditional actor-critic methods require training a separate critic network to estimate value functions. GRPO's group-relative advantage computation replaces this, reducing memory and training complexity.
Lower variance: By comparing trajectories within a group (sampled from the same query), the advantage estimation normalizes for query difficulty—high-reward trajectories are those that perform well relative to other samples on the same question.
Stable training: The clipping mechanism and KL penalty prevent catastrophic policy drift, which is particularly important when the model is learning complex structured behaviors like tool calling.

4. Key Insights and Innovations¶

Innovation 1: Eliminating SFT Through In-Context RL¶

The most significant contribution is demonstrating that supervised pre-training is unnecessary for tool-use learning. Prior work assumed a cold-start problem: RL from scratch fails because the model cannot discover tool-use behaviors through random exploration, so SFT must first bootstrap the capability. ICRL overturns this assumption by using few-shot demonstrations during RL rollouts, providing the guidance that SFT would offer but without requiring labeled datasets.

This is a fundamental innovation, not an incremental improvement. It changes the training paradigm from "SFT then RL" to "RL with in-context scaffolding." The data efficiency gain is substantial: ICRL uses only 3 GPT-generated demonstrations versus methods like O2-Searcher or ReTool that require thousands of annotated trajectories.

Innovation 2: Curriculum Learning from Imitation to Autonomy¶

The gradual reduction of demonstrations forms a curriculum that transitions from imitation to independent problem-solving. This is analogous to training wheels on a bicycle: initially, demonstrations prevent the model from "falling" into unproductive exploration; eventually, they're removed so the model learns to balance independently.

The ablation result—that the three-stage curriculum (skipping the 1-shot stage) outperforms four-stage—is itself an interesting finding. It suggests that intermediate scaffolding levels can be counterproductive, perhaps because the model over-relies on minimal demonstrations rather than internalizing the behavior. The jump from 2-shot to 0-shot forces more aggressive adaptation.

Innovation 3: Loss Masking for Tool-Augmented RL¶

While loss masking is conceptually straightforward, its application to tool-use RL addresses a real technical problem. The paper correctly identifies that optimizing over tool-returned tokens would be meaningless (the model cannot control retrieval results) and potentially harmful (it might teach the model to predict retrieved content rather than reason about it).

This design choice enables the RL loop to operate correctly: the model generates tool calls → tool returns observations → model generates reasoning based on observations → only the model-generated tokens contribute to learning. This is a clean separation of policy optimization from environment responses.

Innovation 4: Cross-Domain Generalization Without Domain-Specific SFT¶

The experiments show ICRL transferring to both web search (QA benchmarks) and code execution (AIME math problems) using the same framework. On math reasoning, ICRL matches or exceeds ReTool on AIME2025 (+2.4%) while slightly underperforming on AIME2024 (-2.9%), despite ReTool using extensive SFT.

This demonstrates that the framework isn't task-specific—it provides a general recipe for teaching tool use. The few-shot demonstrations need only illustrate the tool-call format and reasoning structure; the RL process handles domain-specific adaptation.

5. Experimental Analysis¶

Evaluation Methodology¶

Backbone models: Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-14B-Instruct, and Qwen3-8B. The instruct variants are chosen for their instruction-following capabilities, which accelerate RL convergence.

Training data: Natural Questions (NQ) via FlashRAG, providing real Google Search queries with Wikipedia passages containing answers. Three questions are randomly sampled for GPT-5.2 to generate few-shot demonstrations.

Evaluation benchmarks: - TriviaQA (Joshi et al., 2017): Single-hop factoid QA - HotpotQA (Yang et al., 2018): Multi-hop reasoning requiring 2-hop inference - 2Wiki (Ho et al., 2020): Multi-hop QA with compositional reasoning - Musique (Trivedi et al., 2022): Complex multi-hop questions - Bamboogle (Press et al., 2023): Compositional questions designed to test tool use

NQ is excluded from evaluation to avoid data leakage. Each benchmark is sampled to at most 500 questions.

Metric: Exact Match (EM) accuracy—whether the predicted answer string exactly matches the ground truth.

Baselines: - Direct prompting: Zero-shot inference - CoT: Chain-of-Thought prompting - IRCoT: Interleaving Retrieval with Chain-of-Thought - Search-o1: Agentic search with reasoning models - RAG: Standard retrieval-augmented generation - SFT: Supervised fine-tuning on tool-use trajectories - R1-instruct / R1-base: RL methods without explicit search training - Rejection Sampling: Sampling and selecting high-quality trajectories - Search-R1: RL training for search-augmented reasoning - ZeroSearch: RL method incentivizing search capability - ParallelSearch: RL for parallel query decomposition - O2-Searcher: Cold-start SFT followed by RL - ReTool: SFT+RL for code-execution tool use (math domain)

Main Quantitative Results¶

Table 3: ICRL vs. all baselines (Qwen2.5-3B and Qwen2.5-7B)

On Qwen2.5-3B: | Method | TriviaQA | HotpotQA | 2Wiki | Musique | Bamboogle | Average | |--------|----------|----------|-------|---------|-----------|---------| | Direct | 28.8 | 14.9 | 24.4 | 2.0 | 2.4 | 14.50 | | CoT | 3.2 | 2.1 | 2.1 | 0.2 | 0.0 | 1.52 | | RAG | 54.4 | 25.5 | 22.6 | 4.7 | 8.0 | 23.04 | | Search-R1 | 54.5 | 32.4 | 31.9 | 10.3 | 26.4 | 31.10 | | ZeroSearch | 57.4 | 27.4 | 30.0 | 9.8 | 11.1 | 27.14 | | ICRL | 72.6 | 35.4 | 39.2 | 20.0 | 33.6 | 40.16 |

ICRL achieves +8.94 average EM over the best baseline (Search-R1 at 31.10). Gains are largest on multi-hop datasets: +7.3 on 2Wiki, +9.7 on Musique, +7.2 on Bamboogle.

On Qwen2.5-7B: | Method | TriviaQA | HotpotQA | 2Wiki | Musique | Bamboogle | Average | |--------|----------|----------|-------|---------|-----------|---------| | Direct | 40.8 | 18.3 | 25.0 | 3.1 | 12.0 | 19.84 | | RAG | 58.5 | 29.9 | 23.5 | 5.8 | 20.8 | 27.70 | | Search-R1 | 61.0 | 37.0 | 41.4 | 14.6 | 36.8 | 38.16 | | ZeroSearch | 65.2 | 34.6 | 35.2 | 18.4 | 27.8 | 36.24 | | ParallelSearch | 62.8 | 42.9 | 42.4 | 19.7 | 41.1 | 41.78 | | ICRL | 75.4 | 42.6 | 53.6 | 26.0 | 48.0 | 49.12 |

ICRL achieves +7.34 average EM over ParallelSearch (41.78). It wins on 4 of 5 datasets, with particularly strong performance on TriviaQA (75.4), 2Wiki (53.6), and Bamboogle (48.0).

Table 4: ICRL vs. O2-Searcher (SFT-free vs. cold-start SFT)

On Qwen2.5-3B: | Method | SFT | TriviaQA | HotpotQA | 2Wiki | Musique | Bamboogle | Average | |--------|-----|----------|----------|-------|---------|-----------|---------| | O2-Searcher | ✓ | 59.7 | 38.8 | 37.4 | 16.0 | 34.4 | 37.26 | | ICRL | ✗ | 72.6 | 35.4 | 39.2 | 20.0 | 33.6 | 40.16 |

ICRL outperforms O2-Searcher by +2.9 average EM despite using zero SFT data. The gain on TriviaQA (+12.9) is substantial.

Table 6: Scaling to Qwen2.5-14B

Method	TriviaQA	HotpotQA	2Wiki	Musique	Bamboogle	Average
Direct	52.0	22.6	28.2	6.0	15.2	24.80
CoT	56.4	24.6	25.8	9.0	40.0	31.16
ICRL	75.0	43.2	61.8	25.6	53.6	51.84

ICRL on 14B achieves +20.7 average EM over CoT and +27.0 over direct prompting, demonstrating effective scaling.

Table 7: Math reasoning with code execution (AIME benchmarks)

On Qwen3-8B: | Method | SFT | AIME2024 | AIME2025 | |--------|-----|----------|----------| | ReTool | ✓ | 67.0 | 49.3 | | ICRL | ✗ | 64.1 | 51.7 |

ICRL slightly underperforms ReTool on AIME2024 (-2.9%) but exceeds it on AIME2025 (+2.4%), demonstrating competitive performance without SFT.

Ablation Studies¶

Curriculum design (Figure 2): Comparing 3→2→0 (three-stage) versus 3→2→1→0 (four-stage) curricula:

Curriculum	TriviaQA	HotpotQA	2Wiki	Musique	Bamboogle
3→2→0	75.4	42.6	53.6	26.0	48.0
3→2→1→0	20.8	17.8	26.8	9.0	14.4

The three-stage curriculum dramatically outperforms four-stage. The four-stage leads to faster stopping (80% finish within 2 turns) but lower quality, suggesting that aggressive reduction causes premature stopping and weakens multi-turn reasoning.

Training dynamics (Figure 3): Analysis across 3-shot, 2-shot, and 0-shot stages shows: - Response length drops initially when transitioning to 0-shot (removal of examples) but gradually recovers as the model learns to generate longer, structured outputs independently. - The number of valid tool calls increases during 0-shot training, indicating the model internalizes tool-use behavior even with sparse rewards.

Assessment: Do the Experiments Support the Claims?¶

Claim 1: ICRL achieves SOTA performance on tool-use benchmarks. Strongly supported. ICRL achieves the highest average EM across all model sizes tested (3B, 7B, 14B), outperforming all baselines including recent SFT+RL methods like Search-R1, ZeroSearch, and ParallelSearch.

Claim 2: ICRL eliminates the need for SFT. Supported with strong evidence. Table 4 directly compares ICRL (no SFT) against O2-Searcher (cold-start SFT) on the same model size, showing ICRL achieves higher average EM. Similarly, Table 7 shows ICRL matching ReTool on math benchmarks despite ReTool requiring extensive SFT.

Claim 3: The curriculum enables transition from imitation to autonomy. Supported by ablation results and training dynamics. The dramatic gap between 3→2→0 and 3→2→1→0 curricula shows curriculum design matters, and the increasing valid tool calls during 0-shot training demonstrate internalization.

Potential weaknesses:

Single demonstration source: All experiments use only 3 GPT-5.2-generated demonstrations. The paper doesn't analyze whether demonstration quality matters or how few is "too few."
Limited baseline comparison on larger models: Table 6 only compares ICRL against Direct and CoT for 14B, not against Search-R1 or ParallelSearch, making it unclear if the advantage holds against stronger baselines at scale.
No inference cost analysis: While ICRL eliminates SFT training cost, the paper doesn't analyze inference-time costs. Zero-shot ICRL models may have different latency profiles than few-shot prompting methods.
Sensitivity to hyperparameters: The paper doesn't report sweeps over key hyperparameters like \(\alpha\) (accuracy/format balance), KL penalty \(\beta\), or the number of rollout trajectories per query.

Concrete Example Analysis¶

Table 5 provides a complete example on Bamboogle. The question asks: "When did the president who set the precedent of a two term limit enter office?"

The model correctly: 1. Generates a search query identifying George Washington as the relevant president. 2. Parses the retrieved documents to confirm Washington set the precedent. 3. Issues a follow-up search query for Washington's inauguration date. 4. Extracts and returns the correct answer: "April 30, 1789."

This example demonstrates multi-turn reasoning with tool integration—the model decomposes a compositional question, searches sequentially, and synthesizes information across multiple retrievals.

6. Limitations and Trade-offs¶

Assumption: High-Quality Demonstrations Are Easy to Obtain¶

ICRL relies on the assumption that a small number of demonstration examples (the paper uses just 3) can effectively bootstrap tool-use learning. These demonstrations are generated by GPT-5.2 following the rollout template format. The paper provides no analysis of demonstration quality sensitivity—what happens if the demonstrations contain subtle errors, suboptimal tool calls, or inconsistent formatting?

This assumption is particularly fragile for complex multi-step reasoning tasks. A demonstration that shows a correct final answer but uses an inefficient search strategy (e.g., issuing redundant queries) could teach the model bad habits during the early imitation phase. The paper acknowledges that demonstrations provide "soft supervision" but doesn't analyze whether poor demonstrations would degrade or merely slow learning. The curriculum reduction mechanism would eventually force the model to self-correct via RL signals, but early-stage learning could be significantly impacted.

Additionally, the demonstration generation process uses GPT-5.2, a model more capable than the Qwen2.5-3B being trained. This raises a subtle distribution shift question: the demonstrations may illustrate reasoning patterns that smaller models struggle to replicate, even with in-context guidance.

The Curriculum Design Is Not Fully Understood¶

The ablation on curriculum design reveals a surprising and unexplained result: the three-stage curriculum (3→2→0) dramatically outperforms the four-stage curriculum (3→2→1→0), with TriviaQA accuracy dropping from 75.4 to 20.8. The paper hypothesizes that "aggressively reducing rollout length too early encourages premature stopping and weakens multi-turn reasoning," but this explanation is post-hoc and incomplete.

Why would adding an intermediate 1-shot stage hurt performance? Intuitively, a more gradual reduction should provide smoother curriculum learning. The result suggests something more complex is happening—perhaps the 1-shot stage creates a "crutch" where the model learns to heavily rely on minimal scaffolding, making the final transition to 0-shot more disruptive. Or perhaps the 1-shot demonstrations introduce conflicting signals (the single example may not be representative of the query distribution).

This is a significant limitation because the paper doesn't provide a principled method for choosing curriculum schedules. Practitioners applying ICRL to new domains would need to run expensive ablations to determine the optimal curriculum, undermining the framework's efficiency claims.

Scalability to Larger Models Is Partially Demonstrated¶

Table 6 shows ICRL on Qwen2.5-14B achieving strong results (51.84 average EM), but only compares against Direct and CoT baselines—not against the stronger baselines like Search-R1, ZeroSearch, or ParallelSearch that were evaluated on 3B and 7B models. This makes it unclear whether ICRL's advantage persists at larger scales or whether SFT+RL methods might catch up.

The paper states ICRL "continues to scale effectively to larger model sizes" (Section 4.1), but without comparing against the same baselines, this claim is only weakly supported. It's possible that larger models benefit more from SFT+RL approaches, or that the relative advantage of ICRL diminishes as base model capability increases.

Single Tool Type Per Domain¶

The experiments train separate models for web search (QA benchmarks) and code execution (math benchmarks). The paper doesn't demonstrate multi-tool training—a single model that learns to choose between or combine different tool types (e.g., using both search and code execution within the same reasoning trace).

This is a practical limitation because real-world tool-augmented systems often need multiple tools. The rollout template (Table 1) only describes search tool usage; extending to multiple tools would require more complex demonstration formats and could introduce exploration challenges (the model must learn which tool to call, not just how to call tools).

No Analysis of Inference-Time Costs¶

ICRL eliminates SFT training costs, but the paper doesn't analyze inference-time efficiency. A potential concern: zero-shot ICRL models may require more tokens to achieve correct tool use compared to few-shot prompting, since they must generate reasoning from scratch rather than following demonstrated patterns.

The training dynamics analysis (Figure 3a) shows response length dropping then recovering during 0-shot training, but doesn't compare against inference-time few-shot baselines. If ICRL models require 50% more reasoning tokens to achieve the same accuracy, the training cost savings might be offset by increased deployment costs. This is particularly relevant for production systems with high query volumes.

Sparse Reward Signal for Complex Behaviors¶

The reward function uses only two components: accuracy (binary exact match) and format correctness (violation penalties). This is an extremely sparse signal for learning complex multi-turn tool-use behaviors. The model must discover effective search strategies, query formulation, and information synthesis through pure trial-and-error guided only by final-answer correctness.

While the format reward provides some dense signal for structural learning, it doesn't reward intermediate behaviors like issuing good search queries or reasoning effectively. The paper shows this works empirically (the "number of valid search" metric increases during training), but it's unclear whether more informative rewards (e.g., retrieval relevance scores, reasoning step rewards) would further improve sample efficiency.

The heavy weight on accuracy (\(\alpha = 0.8\)) means format violations are relatively lightly penalized—this could allow models to learn brittle formatting habits that pass the reward threshold but fail on edge cases.

Limited Hyperparameter Sensitivity Analysis¶

The paper fixes several critical hyperparameters without reported ablations: - Number of demonstrations (N=3): Would N=2 or N=5 work better? - Accuracy/format balance (\(\alpha = 0.8\)): How does this trade-off affect learning? - KL penalty coefficient (\(\beta = 0.001\)): Is this optimal or was it inherited from prior work? - Number of rollout trajectories per query (N=8): Does this affect GRPO's advantage estimation quality?

Without sensitivity analysis, practitioners cannot know which parameters are robust versus which require careful tuning for new domains.

Evaluation Set Size and Statistical Reliability¶

Each benchmark is sampled to at most 500 questions, and average EM is computed across five benchmarks. For multi-hop datasets like Musique where ICRL achieves 20.0 EM on Qwen2.5-3B, this represents roughly 100 correct answers out of 500 questions. The paper doesn't report confidence intervals or statistical significance tests.

For smaller differences (e.g., ICRL vs. O2-Searcher on HotpotQA: 35.4 vs. 38.8), the gap could potentially be within noise margins given the sample size. The dramatic curriculum ablation results (75.4 vs. 20.8) are clearly significant, but smaller performance differences should be interpreted cautiously.

7. Implications and Future Directions¶

How This Work Changes the Landscape¶

ICRL challenges a core assumption in the tool-use learning literature: that supervised fine-tuning is a necessary prerequisite for reinforcement learning. Prior work like Search-R1, ZeroSearch, and O2-Searcher all followed an SFT+RL paradigm, accepting the data collection cost as unavoidable. By demonstrating that in-context demonstrations can substitute for SFT, ICRL opens a new axis in the design space: instead of pre-training with labeled data, bootstrap with minimal prompting and let RL do the heavy lifting.

This has practical implications for how tool-augmented systems are developed. Organizations building domain-specific tool-using models can now consider a third option beyond (1) expensive SFT data collection or (2) few-shot prompting with its inference costs: ICRL's lightweight training that requires only a handful of demonstration examples. The +2.9 average EM improvement over O2-Searcher (which uses cold-start SFT) while using zero labeled data is a compelling proof point.

The work also establishes curriculum learning as a first-class concern for RL-based tool-use training. The dramatic ablation result—that curriculum schedule can cause 50+ EM point swings—demonstrates that how demonstrations are phased out matters as much as using them in the first place. This is an underexplored dimension that future work must address.

Follow-Up Research This Work Enables¶

1. Principled curriculum design. The unexpected result that 3→2→0 outperforms 3→2→1→0 demands investigation. What governs optimal curriculum schedules? Potential factors include demonstration quality, query difficulty distribution, and model capacity. Future work could develop adaptive curricula that reduce demonstrations based on learning progress metrics (e.g., reward plateau detection) rather than fixed schedules, or theoretically grounded schedules derived from RL exploration-exploitation trade-offs.

2. Demonstration quality analysis. The paper uses GPT-5.2-generated demonstrations but doesn't analyze their properties. Future work should systematically study: (a) minimum demonstration quantity—can 1-shot or 2-shot work for simpler tasks?; (b) demonstration diversity—should examples cover different reasoning patterns or reinforce a single strategy?; (c) error tolerance—how does learning degrade with noisy demonstrations?

3. Multi-tool integration. The current framework trains separate models for search and code execution. A natural extension is training a single model to use multiple tools, requiring it to learn tool selection in addition to tool invocation. This introduces new challenges: demonstrations must illustrate multiple tool types, and the reward function must handle different output formats.

4. Dense reward shaping. The binary accuracy reward is sparse. Incorporating intermediate rewards—such as retrieval relevance scores from the search engine, code execution success/failure signals, or reasoning coherence metrics—could accelerate learning. However, this risks reward hacking; future work must balance informativeness with robustness.

5. Transfer and generalization across domains. ICRL trains on Natural Questions and evaluates on five QA benchmarks plus AIME math problems. A more comprehensive generalization analysis would test: (a) cross-domain transfer (train on QA, test on code generation); (b) cross-tool transfer (train on search, test on API calling); (c) out-of-distribution queries (novel reasoning patterns not in training data).

6. Scaling analysis with stronger baselines. The 14B experiments lack comparison against Search-R1, ZeroSearch, and ParallelSearch. Determining whether ICRL's advantage persists at larger scales—and how the gap evolves—is essential for understanding the method's practical applicability.

Practical Applications and Downstream Use Cases¶

Domain-specific tool-augmented assistants. Organizations deploying LLMs in specialized domains (legal research, medical diagnosis, financial analysis) often need tool integration (database queries, calculator calls, document retrieval). Collecting SFT data for each domain is prohibitively expensive. ICRL offers a scalable alternative: generate 3–5 domain-appropriate demonstrations, train with RL, and deploy without the labeled data bottleneck.

Rapid tool integration for new APIs. When new tools or APIs become available (e.g., a new search engine, a specialized computation service), ICRL enables fast integration without collecting labeled trajectories. A small set of demonstration queries showing correct API usage suffices to bootstrap learning.

Self-improving agent systems. ICRL's framework naturally extends to self-improvement loops: a model trained with ICRL generates successful trajectories, which become new demonstrations for training smaller/faster models, creating a distillation pipeline. The minimal demonstration requirement makes this tractable.

Edge deployment with efficient training. For on-device models where SFT infrastructure is limited, ICRL's lightweight training approach (no large labeled datasets, standard RL infrastructure) could enable customization of tool-use capabilities without cloud-scale resources.

When to Prefer This Method Over Alternatives¶

Prefer ICRL when: - Labeled tool-use trajectories are unavailable or expensive to collect. - The target domain has clear success criteria (exact match answers, code execution results) for reward computation. - Inference cost is the primary constraint—ICRL produces zero-shot models without prompt overhead. - Multiple related tasks share similar tool-use patterns—train once on one domain and transfer.

Prefer SFT+RL when: - High-quality labeled trajectories already exist (e.g., from expert annotations or existing systems). - The task requires complex tool-use patterns that demonstrations can't easily convey—dense supervision may be necessary. - Maximum performance is critical and data cost is secondary—SFT+RL may have a higher ceiling on some tasks.

Prefer few-shot prompting when: - Training infrastructure is unavailable or deployment must happen immediately. - The task is simple enough that demonstrations alone achieve acceptable performance. - Query volume is low enough that per-query prompt costs don't matter.

For replication or integration, key practical choices are: (1) use instruct variants of models for faster RL convergence; (2) start with 3-shot demonstrations following the Table 1 template; (3) use the 3→2→0 curriculum as the default schedule; (4) set \(\alpha = 0.8\) to prioritize accuracy while maintaining format pressure; (5) apply loss masking to exclude tool-returned tokens from optimization; (6) monitor "valid search" counts during training as a sanity check for tool-use learning.