Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving¶

🎯 Pitch¶

The paper introduces Intern-S1-MO, a multi-agent, multi-round reasoning system that replaces single-pass chain-of-thought with hierarchical lemma-based memory, enabling LRMs to explore vastly longer proof trajectories by storing and reusing verified intermediate lemmas. It also presents OREAL-H, an RL training framework that leverages process-verifier feedback and lemma dependency graphs for improved credit assignment across rounds—together delivering state-of-the-art results on Olympiad-level benchmarks (e.g., 26/35 on non-geometry IMO2025) and demonstrating a practical path beyond fixed context limits for complex mathematical reasoning.

1. Executive Summary (2-3 sentences)¶

Intern-S1-MO is a long-horizon mathematical reasoning agent that targets Olympiad-level problems by replacing a single long chain-of-thought with multi-round, hierarchical reasoning plus a compact “lemma memory” that stores reusable intermediate results (Figure 2). To train this multi-round behavior, the paper introduces OREAL-H, a reinforcement learning framework that uses process-verifier feedback and a lemma dependency graph for better credit assignment across rounds (Eq. (5)–(8), Algorithm 1). On the paper’s evaluations, the system reaches 26/35 points on non-geometry IMO2025 (pass@4) and 102/126 on CMO2025 under human grading (Table 3), while also improving scores on AIME2025/HMMT2025/CNMO2025 (Table 1).

2. Context and Motivation¶

Problem / gap addressed
Large Reasoning Models (LRMs) improve math performance using CoT (Chain-of-Thought) and RL with verifiable rewards (RLVR), but their performance is described as heavily dependent on long reasoning context length.
For very hard problems (e.g., IMO), the reasoning complexity is framed as exceeding what an LRM can explore “in a single round,” even if the model has a large context window.
Why this matters
The paper argues that as problem difficulty increases, both human thinking time and model token consumption grow sharply, creating a mismatch with practical context limits (Figure 1(a)).
It positions the central bottleneck as: context length is limited (often 64k–128k tokens) while IMO-style reasoning can require substantially more exploration and revisiting of partial progress (Introduction; Figure 1).
Prior approaches and where they fall short (as positioned in the paper)
Multi-round interaction / parallel decoding / reflection via prompting: these broaden search or add self-correction, but are described as still operating within a single reasoning cycle that does not cumulatively build on earlier explorations across rounds (Introduction; Related Work §2.1).
Formal-language proof/search systems: they can store structured intermediate results, but are described as requiring extensive state traversal and proof checking (high overhead) and incurring translation costs from informal to formal logic (Introduction).
Proprietary LRMs: reported strong IMO2025 performance, but the community lacks access to systematic methods and models (Introduction).
Paper’s positioning
The paper presents a systematic agent architecture + training pipeline designed specifically for multi-round, lemma-driven exploration, aiming to “break through context constraints” on IMO-level problems (Introduction; Figure 2).
It also positions OREAL-H as addressing RL limitations for agents where rewards are delayed and process-verifier feedback is noisy (Section 4).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a math-problem-solving agent built around a base LRM, expanded into a multi-agent workflow that repeatedly reasons, summarizes intermediate results into lemmas, and verifies them.
It solves long and difficult proofs by turning one long reasoning trace into multiple shorter rounds connected by a compact lemma memory, plus a training method (OREAL-H) that learns from the agent’s own explored trajectories.

3.2 Big-picture architecture (diagram in words)¶

Input: a math problem (and, from round 2 onward, a lemma library / memory).
Reasoner agent: generates a solution attempt or partial progress for the current round.
Summarizer agent: compresses the round’s reasoning trace into a set of candidate lemmas.
Verifier agent(s):
A theorem verifier assigns confidence to intermediate lemmas via multiple verification samples.
A process verifier (from OPV) evaluates final-solution rigor and provides feedback for revision.
Memory system: stores verified lemmas; feeds them back into the next reasoning round (Figure 2).
Final refinement loop: iteratively revises the final solution using process-verifier feedback until it passes or a max loop count is reached (Figure 2).

3.3 Roadmap for the deep dive¶

I will explain:
The multi-round agent loop and why lemma memory matters for long-horizon reasoning (Section 3; Figure 2).
The two verification layers (lemma-level and proof-process-level) and how they mitigate error propagation (Section 3).
The Hierarchical MDP formulation used for training (Section 4.1; Eq. (1)–(3)).
OREAL-H’s two key training ideas: lemma dependency graphs for credit assignment and conjugate reward modeling for noisy process verification (Section 4.3; Eq. (5)–(8), Algorithm 1).
The inference/training budgets and hyperparameters the paper reports (Appendix B; Section 5).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical agent/system paper with an algorithmic RL component, and the core idea is to extend effective reasoning depth by splitting problem solving into multiple rounds connected by a verified lemma memory, then training the policy with RL that uses process-level feedback plus a graph-based progress signal.

3.4.1 Multi-round hierarchical reasoning with lemma memory (Figure 2; Section 3)¶

The agent runs in discrete rounds rather than one long generation, because long IMO-style solutions require exploration, backtracking, and reuse of partial progress that may not fit in one context window (Introduction; Figure 1(a)).
In each round, the reasoner produces either:
a complete solution attempt, or
a partial but rigorous set of intermediate results (lemmas), explicitly encouraged to avoid “premature conclusion bias” where the model rushes to a final answer when budget runs out (Section 3, “Decomposing Sub-Problems for Lemma Search”; Appendix A.1 prompt stresses partial solutions and rigor).
After the reasoner finishes a round, the summarizer compresses the long trace into a structured set of sub-lemmas (Section 3, “Summarizing Exploration for Memory Maintenance”).
The motivation is that raw traces contain redundant trial-and-error; subsequent rounds mostly benefit from the validated intermediate claims, not the full verbose exploration.
These lemmas are stored in a lemma library (memory system) and then fed back into the next reasoning round alongside the original question (Figure 2).
This creates a cumulative “scratchpad” effect: instead of re-deriving the same intermediate results, later rounds can start from previously established lemmas.

System/data pipeline diagram in words (explicit flow): 1. Round 1 starts with the Question alone as input. 2. The Reasoner agent generates a long-chain attempt (possibly partial). 3. The Summarizer agent reads the reasoning trace and outputs a list of candidate lemmas to store. 4. The Lemma verifier checks each lemma (via parallel sampling) and assigns a confidence score; only sufficiently trusted lemmas are added to memory (Section 3, “Verifying Theorems…”). 5. Round 2..(n−1) repeat steps 2–4, but now the input to the reasoner is Question + Lemma Library (Figure 2). 6. In the final round, the reasoner outputs a full solution draft. 7. The Process verifier (OPV) evaluates the final proof and produces natural-language feedback identifying flawed steps; the agent enters a revision loop to fix issues until verification passes or the loop limit is reached (Figure 2; Section 3, “Verifying Process for Final Proof Completion”).

3.4.2 Verification design: lemma-level and process-level (Section 3)¶

Lemma verification to reduce error propagation
The paper highlights a failure mode where an incorrect intermediate conclusion can mislead future reasoning (“circular reasoning or invalid proofs”), so it adds verification at the lemma level because checking a lemma is framed as easier than checking the entire solution (Section 3).
Mechanism: for each lemma, the theorem verifier runs n parallel verifications and uses the proportion of “correctly identified” outcomes as a confidence score (Section 3).
The paper’s aim is to reduce false positives/negatives by voting-style aggregation, though the excerpt does not specify the exact verifier model architecture or thresholding policy used to accept/reject a lemma.
Process verification for final proof completion
For end-to-end correctness and training feedback, the paper uses the OPV process verifier, which provides step-level feedback in natural language (Section 3; Section 4.1 reward discussion).
The paper cites OPV as having F1 > 85% on ProcessBench (Section 3), and uses it in two roles:
1. Test-time scaling: aggregate verification across multiple runs for robustness.
2. Training signal / revision guidance: provide feedback for iterative revision and RL.

3.4.3 Training formulation: Hierarchical MDP + token-level policy gradient (Section 4.1; Eq. (1)–(3))¶

The paper models the overall agent as a Hierarchical Markov Decision Process (MDP):
State s_t includes problem context, reasoning trace, and verification feedback (Section 4.1).
High-level “meta-actions” u_t include things like “extract lemmas,” “invoke verification,” and “commit answer” (Section 4.1).
Low-level actions are the generated token sequences v_t = (v_{t,1}, …, v_{t,T_t}) from the language model policy π^L_θ(·|s_t) (Section 4.1).
The optimization objective is to maximize expected final reward R (Eq. (1)):
J(θ, φ) = E_{π^H_φ, π^L_θ}[R].
Here, π^H_φ denotes the high-level policy (not fully detailed in the excerpt), and π^L_θ is the low-level token policy.
The advantage is computed using a value function V(s_t) updated by temporal difference (Eq. (2)):
V(s_t) ← E[r_t + γ V(s_{t+1})].
The token-level gradient aggregates log-likelihood gradients within each round weighted by the round advantage (Eq. (3)):
∇_θ J = E[ Σ_t A_t · Σ_τ ∇_θ log π^L_θ(v_{t,τ} | s_t, v_{t,<τ}) ].

Important constraint from the excerpt: the paper describes both a high-level and low-level policy, but the provided text does not give implementational details for π^H_φ (parameterization, action encoding, or how it interacts with the multi-agent modules), so any deeper description would be speculative.

3.4.4 Cold start: behavioral cloning on successful structured trajectories (Section 4.2; Eq. (4))¶

Before online RL, the policy is initialized via behavioral cloning on filtered transitions where the output yields a well-formed lemma summary (syntactically valid, non-empty, logically segmented) (Section 4.2).
The pretraining loss is standard token negative log-likelihood over these filtered trajectories (Eq. (4)):
L_RFT(θ) = - E_{(s_t, v_t) ~ D_init} [ Σ_τ log π^L_θ(v_{t,τ} | s_t, v_{t,<τ}) ].
The dataset D_init is continuously augmented with outcome-filtered QA pairs “without previous thinking,” and the paper claims this improves discovery of positive trajectories during RL (Section 4.2).

3.4.5 `OREAL-H`: RL adaptations for multi-round agents (Section 4.3; Eq. (5)–(8), Algorithm 1)¶

The paper starts from OREAL (Outcome Reward RL) and introduces two key modifications to work with: (i) delayed credit assignment across rounds, and (ii) noisy continuous process-verifier rewards.

(A) Progress-conditioned advantage via lemma dependency graphs (Section 4.3.1; Eq. (5)–(6)) - Core idea: build a lemma dependency graph by aggregating lemmas produced across multiple rollouts for the same problem, then propagate success value backward from lemmas that lead to correct solutions. - The paper defines a lemma value recursively (Eq. (5)): - Plain language paraphrase: a lemma is valuable if it tends to lead to successor lemmas that eventually lead to a successful final solution. - Notation: v(l) = E_{l' ∈ Succ(l)}[ v(l') ], where Succ(l) are lemmas derived from l. - For a round t producing candidate lemma set L_t, the state value is taken optimistically as: - V(s_t) = max_{l ∈ L_t} v(l) (Section 4.3.1). - The advantage becomes a temporal-difference style improvement in best reachable lemma value (Eq. (6)): - A_t = r_t + γ max_{l' ∈ L_{t+1}} v(l') − max_{l ∈ L_t} v(l). - Rounds that produce no new lemmas are masked (advantage set to 0), so training focuses updates on rounds that yield “verifiable advances” (Section 4.3.1; Algorithm 1 lines 16–23). - The paper claims this functions like a cheaper surrogate to MCTS-style value estimation, without extensive search overhead (Section 4.3.1; Figure 3 shows an example lemma graph).

(B) Conjugate reward modeling for noisy process verification (Section 4.3.2; Eq. (7)–(8)) - The process verifier (PV) is stochastic: passing k out of n checks is not treated as a clean probability of correctness (Section 4.3.2). - The paper models latent solution quality p ∈ [0,1] with a Beta prior Beta(1,1) and updates after observing k successes out of n PV trials: - Posterior: p | (k,n) ~ Beta(k+1, n-k+1) (Eq. (7)). - Instead of using k/n directly, it defines reward as the probability the solution’s latent quality exceeds a “completely invalid” baseline that fails all checks: - p1 ~ Beta(k+1, n-k+1) (current solution), p0 ~ Beta(1, n+1) (baseline), - reward R(k,n) = P(p1 > p0) (Eq. (8)). - In practice, the paper fixes n = 4 PV trials (Section 4.3.2). - It reports R(4,4) ≈ 5.5 and baseline R(0,4) = 0, describing this as “99.5% dominance probability over the baseline,” with smoothly interpolated rewards for k = 1,2,3 (Section 4.3.2). - Note: the excerpt’s numeric mapping (e.g., reward ≈ 5.5) implies a scaled reward rather than a pure probability in [0,1]; the text mixes “probability” language with a >1 numeric reward. The provided excerpt does not fully clarify this scaling, so the safest interpretation is: they compute a dominance probability and then use a scaled form as the RL reward.

(C) End-to-end OREAL-H loop (Algorithm 1) 1. Sample K multi-round trajectories per problem with the current policy. 2. Run PV to get k/n outcomes and compute conjugate reward. 3. Build lemma dependency graph across trajectories. 4. Backpropagate lemma values and compute progress-conditioned advantages. 5. Optimize the policy with an OREAL-style loss (Algorithm 1 line 27; details of the exact loss are referenced to [18] but not shown in the excerpt).

3.4.6 Budgets, inference scaling, and hyperparameters (Figure 1; Appendix B; Section 5.4)¶

Inference / test-time scaling budgets - Figure 1(b) and the Introduction describe that Intern-S1-MO can use about 512k tokens to solve a single problem, motivated as ~8× extension over a 64k context constraint (Introduction; Conclusion). - Default evaluation budgets (Appendix B.1): - Max 8 inference rounds for reasoner and summarizer. - Theorem verifier: 4 parallel verifications per lemma. - Final iterative revision (process verifier loop): max 8 rounds. - Max length of output for reasoner and summarizer: 64k tokens per output.

Training hyperparameters (Appendix B.2) - Batch: 64 questions per batch. - Rollouts: 16 rollouts per question. - Max length per rollout trajectory: 65,536 tokens. - Filtering: discard questions with overall pass rate 0 or 1. - Optimizer: AdamW. - Learning rate: 5e−7, cosine annealing schedule decaying to 1/5 of initial LR. - KL coefficient β: 0.01.

Official CMO2025 participation budget (Section 5.4) - For each problem: - 256-shot parallel search over up to 12 rounds. - Intermediate lemmas: lemma verifier provides multiple rounds of 8-shot feedback. - Candidate solution refinement: 8-shot refinement procedure comprising 24 rounds, where OPV flags issues and the policy revises accordingly.

Model architecture details (missing in provided excerpt) - The excerpt does not specify core LRM architecture/configuration such as: - number of layers, hidden size, attention heads, tokenizer, context window during pretraining, hardware, total training tokens, or compute budget. - Because the user asked for a deep reading grounded only in provided text, these details cannot be supplied without guessing.

4. Key Insights and Innovations¶

(1) Lemma-based compact memory enabling multi-round reasoning (Figure 2; Section 3)
Novelty relative to single-pass CoT: instead of relying on one extended context, the system repeatedly compresses reasoning into reusable lemmas and re-injects them as memory in later rounds.
Significance: enables “unlimited exploration capability” in principle by keeping only essential validated statements, letting the agent explore lemma-rich solution spaces beyond a single context limit.
(2) Explicit separation of roles into reasoner / summarizer / verifier agents (Figure 2)
Unlike prompt-only self-reflection, the paper operationalizes different cognitive tasks (explore vs compress vs check) as separate agent steps.
This is positioned as a systematic structure rather than ad-hoc prompting, and ablations suggest each part contributes (Table 2).
(3) Lemma verification via parallel sampling confidence scores (Section 3)
The paper treats lemma verification as easier than full-proof verification, using multiple verifier runs and a confidence score to reduce error propagation.
This explicitly targets a known multi-step reasoning failure mode: compounding errors when later steps rely on flawed intermediate claims.
(4) OREAL-H: RL credit assignment using lemma dependency graphs (Section 4.3.1; Eq. (5)–(6); Figure 3)
Novelty: assigns learning signal to rounds that make “verifiable progress” by estimating lemma values in a graph aggregated over rollouts, rather than relying only on final correctness.
Significance: the paper frames this as reducing variance and avoiding wasted updates on unproductive long trajectories.
(5) Conjugate reward model for noisy process verification (Section 4.3.2; Eq. (7)–(8))
Novelty: replaces raw k/n PV pass rates with a Bayesian dominance probability against an invalid baseline.
Significance: intended to stabilize learning when the verifier is stochastic and imperfect, preserving strong gradients for high-confidence passes while suppressing “lucky” passes.

5. Experimental Analysis¶

Evaluation methodology (Section 5.1; Appendix D)¶

Benchmarks used
AIME2025 (answer-focused)
HMMT2025 Feb (answer-focused)
IMO2025 (non-geometry only)
CNMO2025 (non-geometry only)
CMO2025 (official participation, human-graded)
Metrics
For AIME2025, HMMT2025, CNMO2025: pass@1 (with 16 independent rollouts and “unbiased pass@1” per [5]) (Section 5.1).
For IMO2025: pass@4 (Section 5.1; Table 1).
For proof-oriented datasets (CNMO2025, IMO2025): an LLM-based fine-grained grading scheme inspired by MathArena, with modifications to better bind sub-propositions to proof obligations (Appendix D).
- Ensemble evaluation: each solution is graded across N = 8 independent runs; final score is the mean across runs (Appendix D).
Baselines (Section 5.1; Table 1)
Gemini2.5-pro, o3-high, Grok4, GPT-OSS-120B, DeepSeek-R1-0528, Qwen3-235B-A22B.
For some baseline scores on AIME/HMMT, the paper reuses numbers from technical reports / MathArena (Table 1 caption; Section 5.1).

Main quantitative results (Table 1; Table 3)¶

Table 1 (overall benchmark comparison) - Intern-S1-MO: - HMMT2025: 95.0 - AIME2025: 96.6 - CNMO2025: 232.4 - IMO2025 (non-geometry, pass@4): 26 - Intern-S1-mini-MO: - HMMT2025: 79.2 - AIME2025: 87.3 - CNMO2025: 176.3 - IMO2025: 17 - Best baseline numbers shown in Table 1 include: - HMMT2025: baseline best is Grok4 at 92.5, while Intern-S1-MO is 95.0. - AIME2025: baseline best is GPT-OSS-120B at 92.5 (Table 1), while Intern-S1-MO is 96.6. - CNMO2025: baseline best is Gemini2.5-pro at 157.5, while Intern-S1-MO is 232.4. - IMO2025: baselines shown range up to 14 (Gemini2.5-pro and Qwen3-235B-A22B), while Intern-S1-MO is 26.

Table 3 (official CMO2025 participation) - Total: 102 / 126 - Per problem: P1=21, P2=21, P3=9, P4=21, P5=21, P6=9. - The paper states this exceeds a gold threshold of 78 points (Section 5.4).

Ablation studies (Table 2)¶

The ablation isolates additive contributions on HMMT2025, AIME2025, CNMO2025:
Single-round with Agents: 70.8 / 81.9 / 178.0
+ Multi-round Reasoning: 85.4 / 91.0 / 201.7
+ Theorem Verifier: 86.3 / 93.3 / 203.0
+ Process Verifier: 89.1 / 94.0 / 215.2
+ OReal-H: 95.0 / 96.6 / 232.4
Interpretation grounded in Table 2:
The largest single jump comes from adding multi-round reasoning (e.g., CNMO2025: 178.0 → 201.7).
Verification components add incremental improvements.
OREAL-H adds a substantial final gain (e.g., CNMO2025: 215.2 → 232.4).

Do the experiments support the claims?¶

The provided results convincingly support the narrower claim that:
among the models compared in Table 1, Intern-S1-MO achieves the best reported scores on these benchmarks, and
each architectural/training component contributes in the direction expected (Table 2).
Two caveats that remain based on the excerpt:
For some baselines on AIME/HMMT, scores are imported from other reports/MathArena rather than re-run under identical conditions (Table 1 caption), which can complicate strict apples-to-apples comparisons.
The excerpt does not provide confidence intervals or statistical tests; variability is mentioned for IMO2025 due to only 5 problems (Section 5.3).

6. Limitations and Trade-offs¶

High inference cost / test-time scaling
The system is explicitly designed to scale test-time compute (Figure 1; Appendix B.1; Section 5.4), including up to 512k tokens per problem and large parallel search budgets (e.g., 256-shot in CMO2025).
This is a trade-off: performance gains may depend on substantial inference-time resources.
Verifier dependence and noise
The approach relies on theorem verification and process verification to prevent error propagation (Section 3).
Even with conjugate reward denoising, PV feedback is acknowledged as noisy (Section 4.3.2), and verifier errors could still steer search/training.
Missing architectural reproducibility details (from the provided excerpt)
The excerpt does not include core model architecture parameters (layers, dimensions, tokenizer, pretraining compute/tokens, hardware), limiting full reproducibility from this text alone.
Scope restrictions in evaluation
For IMO2025 and CNMO2025, the evaluation explicitly excludes geometry problems (Section 5.1).
Therefore, conclusions in the excerpt primarily support non-geometry Olympiad problem solving.
Potential memory contamination / lemma quality management
The paper’s own motivation emphasizes error propagation; despite lemma verification, storing incorrect or ambiguous lemmas could still degrade later rounds.
The excerpt does not specify acceptance thresholds or memory pruning strategies beyond verification/voting, so it is unclear how robust the lemma library is under adversarial or highly ambiguous intermediate claims.

7. Implications and Future Directions¶

How this changes the landscape
The work suggests that pushing beyond context limits for Olympiad math may be less about ever-longer single-pass contexts and more about agentic decomposition + durable, verified intermediate memory (Figure 2; Conclusion).
It also frames process verification not only as an evaluator but as a training signal and revision guide, integrating verification into the reasoning loop (Section 3; Section 4).
Follow-up research enabled
Better lemma memory management: selection, deduplication, contradiction detection, and retrieval policies are natural next steps, since lemma quality is central.
Stronger graph-based credit assignment: the lemma dependency graph idea (Figure 3) could be expanded with richer edge definitions or more principled value backups, but the excerpt does not detail these choices.
Improving verifier calibration: since PV noise is important enough to motivate conjugate rewards, improved calibration or joint training of verifier + policy could further stabilize learning.
Practical applications / downstream use
A system like this can be applied where solutions require long proofs with intermediate milestones: competition math, proof drafting, and iterative refinement workflows (Conclusion’s “math research” motivation).
The paper’s official competition participation (CMO2025) indicates an intended use case: real-world timed settings with human-style proof submissions (Section 5.4).
Repro/Integration Guidance (based on the provided excerpt)
Prefer this agentic approach over single-pass prompting when:
- the problem requires building many intermediate claims and revisiting them across multiple attempts (IMO-style), and
- you can afford multi-round inference budgets and verification overhead (Appendix B.1; Section 5.4).
If you only need final-answer tasks (AIME/HMMT style), the gains may be smaller relative to strong baselines (paper’s own discussion in Section 5.2), so the added complexity might not be cost-effective.
Training-wise, the excerpt indicates a staged recipe:
1. generate trajectories (variant of Intern-S1) and filter for structured lemma summaries (Section 5.1; Section 4.2),
2. behavioral cloning cold start (Eq. (4)),
3. online RL with PV-based rewards + lemma-graph advantages (OREAL-H) (Section 4.3; Algorithm 1), using reported hyperparameters (Appendix B.2: AdamW, LR 5e−7, cosine schedule, β=0.01, batch=64 questions, 16 rollouts/question, max 65,536 tokens/rollout).