Skip to content

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

ArXiv: 2512.10739

🎯 Pitch

The paper introduces Intern-S1-MO, a multi-agent, multi-round reasoning system that replaces single-pass chain-of-thought with hierarchical lemma-based memory, enabling LRMs to explore vastly longer proof trajectories by storing and reusing verified intermediate lemmas. It also presents OREAL-H, an RL training framework that leverages process-verifier feedback and lemma dependency graphs for improved credit assignment across rounds—together delivering state-of-the-art results on Olympiad-level benchmarks (e.g., 26/35 on non-geometry IMO2025) and demonstrating a practical path beyond fixed context limits for complex mathematical reasoning.


1. Executive Summary (2-3 sentences)

Intern-S1-MO is a long-horizon mathematical reasoning agent that targets Olympiad-level problems by replacing a single long chain-of-thought with multi-round, hierarchical reasoning plus a compact “lemma memory” that stores reusable intermediate results (Figure 2). To train this multi-round behavior, the paper introduces OREAL-H, a reinforcement learning framework that uses process-verifier feedback and a lemma dependency graph for better credit assignment across rounds (Eq. (5)–(8), Algorithm 1). On the paper’s evaluations, the system reaches 26/35 points on non-geometry IMO2025 (pass@4) and 102/126 on CMO2025 under human grading (Table 3), while also improving scores on AIME2025/HMMT2025/CNMO2025 (Table 1).

2. Context and Motivation

  • Problem / gap addressed
  • Large Reasoning Models (LRMs) improve math performance using CoT (Chain-of-Thought) and RL with verifiable rewards (RLVR), but their performance is described as heavily dependent on long reasoning context length.
  • For very hard problems (e.g., IMO), the reasoning complexity is framed as exceeding what an LRM can explore “in a single round,” even if the model has a large context window.

  • Why this matters

  • The paper argues that as problem difficulty increases, both human thinking time and model token consumption grow sharply, creating a mismatch with practical context limits (Figure 1(a)).
  • It positions the central bottleneck as: context length is limited (often 64k–128k tokens) while IMO-style reasoning can require substantially more exploration and revisiting of partial progress (Introduction; Figure 1).

  • Prior approaches and where they fall short (as positioned in the paper)

  • Multi-round interaction / parallel decoding / reflection via prompting: these broaden search or add self-correction, but are described as still operating within a single reasoning cycle that does not cumulatively build on earlier explorations across rounds (Introduction; Related Work §2.1).
  • Formal-language proof/search systems: they can store structured intermediate results, but are described as requiring extensive state traversal and proof checking (high overhead) and incurring translation costs from informal to formal logic (Introduction).
  • Proprietary LRMs: reported strong IMO2025 performance, but the community lacks access to systematic methods and models (Introduction).

  • Paper’s positioning

  • The paper presents a systematic agent architecture + training pipeline designed specifically for multi-round, lemma-driven exploration, aiming to “break through context constraints” on IMO-level problems (Introduction; Figure 2).
  • It also positions OREAL-H as addressing RL limitations for agents where rewards are delayed and process-verifier feedback is noisy (Section 4).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a math-problem-solving agent built around a base LRM, expanded into a multi-agent workflow that repeatedly reasons, summarizes intermediate results into lemmas, and verifies them.
  • It solves long and difficult proofs by turning one long reasoning trace into multiple shorter rounds connected by a compact lemma memory, plus a training method (OREAL-H) that learns from the agent’s own explored trajectories.

3.2 Big-picture architecture (diagram in words)

  • Input: a math problem (and, from round 2 onward, a lemma library / memory).
  • Reasoner agent: generates a solution attempt or partial progress for the current round.
  • Summarizer agent: compresses the round’s reasoning trace into a set of candidate lemmas.
  • Verifier agent(s):
  • A theorem verifier assigns confidence to intermediate lemmas via multiple verification samples.
  • A process verifier (from OPV) evaluates final-solution rigor and provides feedback for revision.
  • Memory system: stores verified lemmas; feeds them back into the next reasoning round (Figure 2).
  • Final refinement loop: iteratively revises the final solution using process-verifier feedback until it passes or a max loop count is reached (Figure 2).

3.3 Roadmap for the deep dive

  • I will explain:
  • The multi-round agent loop and why lemma memory matters for long-horizon reasoning (Section 3; Figure 2).
  • The two verification layers (lemma-level and proof-process-level) and how they mitigate error propagation (Section 3).
  • The Hierarchical MDP formulation used for training (Section 4.1; Eq. (1)–(3)).
  • OREAL-H’s two key training ideas: lemma dependency graphs for credit assignment and conjugate reward modeling for noisy process verification (Section 4.3; Eq. (5)–(8), Algorithm 1).
  • The inference/training budgets and hyperparameters the paper reports (Appendix B; Section 5).

3.4 Detailed, sentence-based technical breakdown

This is an empirical agent/system paper with an algorithmic RL component, and the core idea is to extend effective reasoning depth by splitting problem solving into multiple rounds connected by a verified lemma memory, then training the policy with RL that uses process-level feedback plus a graph-based progress signal.

3.4.1 Multi-round hierarchical reasoning with lemma memory (Figure 2; Section 3)

  • The agent runs in discrete rounds rather than one long generation, because long IMO-style solutions require exploration, backtracking, and reuse of partial progress that may not fit in one context window (Introduction; Figure 1(a)).
  • In each round, the reasoner produces either:
  • a complete solution attempt, or
  • a partial but rigorous set of intermediate results (lemmas), explicitly encouraged to avoid “premature conclusion bias” where the model rushes to a final answer when budget runs out (Section 3, “Decomposing Sub-Problems for Lemma Search”; Appendix A.1 prompt stresses partial solutions and rigor).
  • After the reasoner finishes a round, the summarizer compresses the long trace into a structured set of sub-lemmas (Section 3, “Summarizing Exploration for Memory Maintenance”).
  • The motivation is that raw traces contain redundant trial-and-error; subsequent rounds mostly benefit from the validated intermediate claims, not the full verbose exploration.
  • These lemmas are stored in a lemma library (memory system) and then fed back into the next reasoning round alongside the original question (Figure 2).
  • This creates a cumulative “scratchpad” effect: instead of re-deriving the same intermediate results, later rounds can start from previously established lemmas.

System/data pipeline diagram in words (explicit flow): 1. Round 1 starts with the Question alone as input. 2. The Reasoner agent generates a long-chain attempt (possibly partial). 3. The Summarizer agent reads the reasoning trace and outputs a list of candidate lemmas to store. 4. The Lemma verifier checks each lemma (via parallel sampling) and assigns a confidence score; only sufficiently trusted lemmas are added to memory (Section 3, “Verifying Theorems…”). 5. Round 2..(n−1) repeat steps 2–4, but now the input to the reasoner is Question + Lemma Library (Figure 2). 6. In the final round, the reasoner outputs a full solution draft. 7. The Process verifier (OPV) evaluates the final proof and produces natural-language feedback identifying flawed steps; the agent enters a revision loop to fix issues until verification passes or the loop limit is reached (Figure 2; Section 3, “Verifying Process for Final Proof Completion”).

3.4.2 Verification design: lemma-level and process-level (Section 3)

  • Lemma verification to reduce error propagation
  • The paper highlights a failure mode where an incorrect intermediate conclusion can mislead future reasoning (“circular reasoning or invalid proofs”), so it adds verification at the lemma level because checking a lemma is framed as easier than checking the entire solution (Section 3).
  • Mechanism: for each lemma, the theorem verifier runs n parallel verifications and uses the proportion of “correctly identified” outcomes as a confidence score (Section 3).
  • The paper’s aim is to reduce false positives/negatives by voting-style aggregation, though the excerpt does not specify the exact verifier model architecture or thresholding policy used to accept/reject a lemma.

  • Process verification for final proof completion

  • For end-to-end correctness and training feedback, the paper uses the OPV process verifier, which provides step-level feedback in natural language (Section 3; Section 4.1 reward discussion).
  • The paper cites OPV as having F1 > 85% on ProcessBench (Section 3), and uses it in two roles:
    1. Test-time scaling: aggregate verification across multiple runs for robustness.
    2. Training signal / revision guidance: provide feedback for iterative revision and RL.

3.4.3 Training formulation: Hierarchical MDP + token-level policy gradient (Section 4.1; Eq. (1)–(3))

  • The paper models the overall agent as a Hierarchical Markov Decision Process (MDP):
  • State s_t includes problem context, reasoning trace, and verification feedback (Section 4.1).
  • High-level “meta-actions” u_t include things like “extract lemmas,” “invoke verification,” and “commit answer” (Section 4.1).
  • Low-level actions are the generated token sequences v_t = (v_{t,1}, …, v_{t,T_t}) from the language model policy π^L_θ(·|s_t) (Section 4.1).
  • The optimization objective is to maximize expected final reward R (Eq. (1)):
  • J(θ, φ) = E_{π^H_φ, π^L_θ}[R].
  • Here, π^H_φ denotes the high-level policy (not fully detailed in the excerpt), and π^L_θ is the low-level token policy.
  • The advantage is computed using a value function V(s_t) updated by temporal difference (Eq. (2)):
  • V(s_t) ← E[r_t + γ V(s_{t+1})].
  • The token-level gradient aggregates log-likelihood gradients within each round weighted by the round advantage (Eq. (3)):
  • ∇_θ J = E[ Σ_t A_t · Σ_τ ∇_θ log π^L_θ(v_{t,τ} | s_t, v_{t,<τ}) ].

Important constraint from the excerpt: the paper describes both a high-level and low-level policy, but the provided text does not give implementational details for π^H_φ (parameterization, action encoding, or how it interacts with the multi-agent modules), so any deeper description would be speculative.

3.4.4 Cold start: behavioral cloning on successful structured trajectories (Section 4.2; Eq. (4))

  • Before online RL, the policy is initialized via behavioral cloning on filtered transitions where the output yields a well-formed lemma summary (syntactically valid, non-empty, logically segmented) (Section 4.2).
  • The pretraining loss is standard token negative log-likelihood over these filtered trajectories (Eq. (4)):
  • L_RFT(θ) = - E_{(s_t, v_t) ~ D_init} [ Σ_τ log π^L_θ(v_{t,τ} | s_t, v_{t,<τ}) ].
  • The dataset D_init is continuously augmented with outcome-filtered QA pairs “without previous thinking,” and the paper claims this improves discovery of positive trajectories during RL (Section 4.2).

3.4.5 OREAL-H: RL adaptations for multi-round agents (Section 4.3; Eq. (5)–(8), Algorithm 1)

The paper starts from OREAL (Outcome Reward RL) and introduces two key modifications to work with: (i) delayed credit assignment across rounds, and (ii) noisy continuous process-verifier rewards.

(A) Progress-conditioned advantage via lemma dependency graphs (Section 4.3.1; Eq. (5)–(6)) - Core idea: build a lemma dependency graph by aggregating lemmas produced across multiple rollouts for the same problem, then propagate success value backward from lemmas that lead to correct solutions. - The paper defines a lemma value recursively (Eq. (5)): - Plain language paraphrase: a lemma is valuable if it tends to lead to successor lemmas that eventually lead to a successful final solution. - Notation: v(l) = E_{l' ∈ Succ(l)}[ v(l') ], where Succ(l) are lemmas derived from l. - For a round t producing candidate lemma set L_t, the state value is taken optimistically as: - V(s_t) = max_{l ∈ L_t} v(l) (Section 4.3.1). - The advantage becomes a temporal-difference style improvement in best reachable lemma value (Eq. (6)): - A_t = r_t + γ max_{l' ∈ L_{t+1}} v(l') − max_{l ∈ L_t} v(l). - Rounds that produce no new lemmas are masked (advantage set to 0), so training focuses updates on rounds that yield “verifiable advances” (Section 4.3.1; Algorithm 1 lines 16–23). - The paper claims this functions like a cheaper surrogate to MCTS-style value estimation, without extensive search overhead (Section 4.3.1; Figure 3 shows an example lemma graph).

(B) Conjugate reward modeling for noisy process verification (Section 4.3.2; Eq. (7)–(8)) - The process verifier (PV) is stochastic: passing k out of n checks is not treated as a clean probability of correctness (Section 4.3.2). - The paper models latent solution quality p ∈ [0,1] with a Beta prior Beta(1,1) and updates after observing k successes out of n PV trials: - Posterior: p | (k,n) ~ Beta(k+1, n-k+1) (Eq. (7)). - Instead of using k/n directly, it defines reward as the probability the solution’s latent quality exceeds a “completely invalid” baseline that fails all checks: - p1 ~ Beta(k+1, n-k+1) (current solution), p0 ~ Beta(1, n+1) (baseline), - reward R(k,n) = P(p1 > p0) (Eq. (8)). - In practice, the paper fixes n = 4 PV trials (Section 4.3.2). - It reports R(4,4) ≈ 5.5 and baseline R(0,4) = 0, describing this as “99.5% dominance probability over the baseline,” with smoothly interpolated rewards for k = 1,2,3 (Section 4.3.2). - Note: the excerpt’s numeric mapping (e.g., reward ≈ 5.5) implies a scaled reward rather than a pure probability in [0,1]; the text mixes “probability” language with a >1 numeric reward. The provided excerpt does not fully clarify this scaling, so the safest interpretation is: they compute a dominance probability and then use a scaled form as the RL reward.

(C) End-to-end OREAL-H loop (Algorithm 1) 1. Sample K multi-round trajectories per problem with the current policy. 2. Run PV to get k/n outcomes and compute conjugate reward. 3. Build lemma dependency graph across trajectories. 4. Backpropagate lemma values and compute progress-conditioned advantages. 5. Optimize the policy with an OREAL-style loss (Algorithm 1 line 27; details of the exact loss are referenced to [18] but not shown in the excerpt).

3.4.6 Budgets, inference scaling, and hyperparameters (Figure 1; Appendix B; Section 5.4)

Inference / test-time scaling budgets - Figure 1(b) and the Introduction describe that Intern-S1-MO can use about 512k tokens to solve a single problem, motivated as ~8× extension over a 64k context constraint (Introduction; Conclusion). - Default evaluation budgets (Appendix B.1): - Max 8 inference rounds for reasoner and summarizer. - Theorem verifier: 4 parallel verifications per lemma. - Final iterative revision (process verifier loop): max 8 rounds. - Max length of output for reasoner and summarizer: 64k tokens per output.

Training hyperparameters (Appendix B.2) - Batch: 64 questions per batch. - Rollouts: 16 rollouts per question. - Max length per rollout trajectory: 65,536 tokens. - Filtering: discard questions with overall pass rate 0 or 1. - Optimizer: AdamW. - Learning rate: 5e−7, cosine annealing schedule decaying to 1/5 of initial LR. - KL coefficient β: 0.01.

Official CMO2025 participation budget (Section 5.4) - For each problem: - 256-shot parallel search over up to 12 rounds. - Intermediate lemmas: lemma verifier provides multiple rounds of 8-shot feedback. - Candidate solution refinement: 8-shot refinement procedure comprising 24 rounds, where OPV flags issues and the policy revises accordingly.

Model architecture details (missing in provided excerpt) - The excerpt does not specify core LRM architecture/configuration such as: - number of layers, hidden size, attention heads, tokenizer, context window during pretraining, hardware, total training tokens, or compute budget. - Because the user asked for a deep reading grounded only in provided text, these details cannot be supplied without guessing.

4. Key Insights and Innovations

  • (1) Lemma-based compact memory enabling multi-round reasoning (Figure 2; Section 3)
  • Novelty relative to single-pass CoT: instead of relying on one extended context, the system repeatedly compresses reasoning into reusable lemmas and re-injects them as memory in later rounds.
  • Significance: enables “unlimited exploration capability” in principle by keeping only essential validated statements, letting the agent explore lemma-rich solution spaces beyond a single context limit.

  • (2) Explicit separation of roles into reasoner / summarizer / verifier agents (Figure 2)

  • Unlike prompt-only self-reflection, the paper operationalizes different cognitive tasks (explore vs compress vs check) as separate agent steps.
  • This is positioned as a systematic structure rather than ad-hoc prompting, and ablations suggest each part contributes (Table 2).

  • (3) Lemma verification via parallel sampling confidence scores (Section 3)

  • The paper treats lemma verification as easier than full-proof verification, using multiple verifier runs and a confidence score to reduce error propagation.
  • This explicitly targets a known multi-step reasoning failure mode: compounding errors when later steps rely on flawed intermediate claims.

  • (4) OREAL-H: RL credit assignment using lemma dependency graphs (Section 4.3.1; Eq. (5)–(6); Figure 3)

  • Novelty: assigns learning signal to rounds that make “verifiable progress” by estimating lemma values in a graph aggregated over rollouts, rather than relying only on final correctness.
  • Significance: the paper frames this as reducing variance and avoiding wasted updates on unproductive long trajectories.

  • (5) Conjugate reward model for noisy process verification (Section 4.3.2; Eq. (7)–(8))

  • Novelty: replaces raw k/n PV pass rates with a Bayesian dominance probability against an invalid baseline.
  • Significance: intended to stabilize learning when the verifier is stochastic and imperfect, preserving strong gradients for high-confidence passes while suppressing “lucky” passes.

5. Experimental Analysis

Evaluation methodology (Section 5.1; Appendix D)

  • Benchmarks used
  • AIME2025 (answer-focused)
  • HMMT2025 Feb (answer-focused)
  • IMO2025 (non-geometry only)
  • CNMO2025 (non-geometry only)
  • CMO2025 (official participation, human-graded)

  • Metrics

  • For AIME2025, HMMT2025, CNMO2025: pass@1 (with 16 independent rollouts and “unbiased pass@1” per [5]) (Section 5.1).
  • For IMO2025: pass@4 (Section 5.1; Table 1).
  • For proof-oriented datasets (CNMO2025, IMO2025): an LLM-based fine-grained grading scheme inspired by MathArena, with modifications to better bind sub-propositions to proof obligations (Appendix D).

    • Ensemble evaluation: each solution is graded across N = 8 independent runs; final score is the mean across runs (Appendix D).
  • Baselines (Section 5.1; Table 1)

  • Gemini2.5-pro, o3-high, Grok4, GPT-OSS-120B, DeepSeek-R1-0528, Qwen3-235B-A22B.
  • For some baseline scores on AIME/HMMT, the paper reuses numbers from technical reports / MathArena (Table 1 caption; Section 5.1).

Main quantitative results (Table 1; Table 3)

Table 1 (overall benchmark comparison) - Intern-S1-MO: - HMMT2025: 95.0 - AIME2025: 96.6 - CNMO2025: 232.4 - IMO2025 (non-geometry, pass@4): 26 - Intern-S1-mini-MO: - HMMT2025: 79.2 - AIME2025: 87.3 - CNMO2025: 176.3 - IMO2025: 17 - Best baseline numbers shown in Table 1 include: - HMMT2025: baseline best is Grok4 at 92.5, while Intern-S1-MO is 95.0. - AIME2025: baseline best is GPT-OSS-120B at 92.5 (Table 1), while Intern-S1-MO is 96.6. - CNMO2025: baseline best is Gemini2.5-pro at 157.5, while Intern-S1-MO is 232.4. - IMO2025: baselines shown range up to 14 (Gemini2.5-pro and Qwen3-235B-A22B), while Intern-S1-MO is 26.

Table 3 (official CMO2025 participation) - Total: 102 / 126 - Per problem: P1=21, P2=21, P3=9, P4=21, P5=21, P6=9. - The paper states this exceeds a gold threshold of 78 points (Section 5.4).

Ablation studies (Table 2)

  • The ablation isolates additive contributions on HMMT2025, AIME2025, CNMO2025:
  • Single-round with Agents: 70.8 / 81.9 / 178.0
  • + Multi-round Reasoning: 85.4 / 91.0 / 201.7
  • + Theorem Verifier: 86.3 / 93.3 / 203.0
  • + Process Verifier: 89.1 / 94.0 / 215.2
  • + OReal-H: 95.0 / 96.6 / 232.4
  • Interpretation grounded in Table 2:
  • The largest single jump comes from adding multi-round reasoning (e.g., CNMO2025: 178.0 → 201.7).
  • Verification components add incremental improvements.
  • OREAL-H adds a substantial final gain (e.g., CNMO2025: 215.2 → 232.4).

Do the experiments support the claims?

  • The provided results convincingly support the narrower claim that:
  • among the models compared in Table 1, Intern-S1-MO achieves the best reported scores on these benchmarks, and
  • each architectural/training component contributes in the direction expected (Table 2).
  • Two caveats that remain based on the excerpt:
  • For some baselines on AIME/HMMT, scores are imported from other reports/MathArena rather than re-run under identical conditions (Table 1 caption), which can complicate strict apples-to-apples comparisons.
  • The excerpt does not provide confidence intervals or statistical tests; variability is mentioned for IMO2025 due to only 5 problems (Section 5.3).

6. Limitations and Trade-offs

  • High inference cost / test-time scaling
  • The system is explicitly designed to scale test-time compute (Figure 1; Appendix B.1; Section 5.4), including up to 512k tokens per problem and large parallel search budgets (e.g., 256-shot in CMO2025).
  • This is a trade-off: performance gains may depend on substantial inference-time resources.

  • Verifier dependence and noise

  • The approach relies on theorem verification and process verification to prevent error propagation (Section 3).
  • Even with conjugate reward denoising, PV feedback is acknowledged as noisy (Section 4.3.2), and verifier errors could still steer search/training.

  • Missing architectural reproducibility details (from the provided excerpt)

  • The excerpt does not include core model architecture parameters (layers, dimensions, tokenizer, pretraining compute/tokens, hardware), limiting full reproducibility from this text alone.

  • Scope restrictions in evaluation

  • For IMO2025 and CNMO2025, the evaluation explicitly excludes geometry problems (Section 5.1).
  • Therefore, conclusions in the excerpt primarily support non-geometry Olympiad problem solving.

  • Potential memory contamination / lemma quality management

  • The paper’s own motivation emphasizes error propagation; despite lemma verification, storing incorrect or ambiguous lemmas could still degrade later rounds.
  • The excerpt does not specify acceptance thresholds or memory pruning strategies beyond verification/voting, so it is unclear how robust the lemma library is under adversarial or highly ambiguous intermediate claims.

7. Implications and Future Directions

  • How this changes the landscape
  • The work suggests that pushing beyond context limits for Olympiad math may be less about ever-longer single-pass contexts and more about agentic decomposition + durable, verified intermediate memory (Figure 2; Conclusion).
  • It also frames process verification not only as an evaluator but as a training signal and revision guide, integrating verification into the reasoning loop (Section 3; Section 4).

  • Follow-up research enabled

  • Better lemma memory management: selection, deduplication, contradiction detection, and retrieval policies are natural next steps, since lemma quality is central.
  • Stronger graph-based credit assignment: the lemma dependency graph idea (Figure 3) could be expanded with richer edge definitions or more principled value backups, but the excerpt does not detail these choices.
  • Improving verifier calibration: since PV noise is important enough to motivate conjugate rewards, improved calibration or joint training of verifier + policy could further stabilize learning.

  • Practical applications / downstream use

  • A system like this can be applied where solutions require long proofs with intermediate milestones: competition math, proof drafting, and iterative refinement workflows (Conclusion’s “math research” motivation).
  • The paper’s official competition participation (CMO2025) indicates an intended use case: real-world timed settings with human-style proof submissions (Section 5.4).

  • Repro/Integration Guidance (based on the provided excerpt)

  • Prefer this agentic approach over single-pass prompting when:
    • the problem requires building many intermediate claims and revisiting them across multiple attempts (IMO-style), and
    • you can afford multi-round inference budgets and verification overhead (Appendix B.1; Section 5.4).
  • If you only need final-answer tasks (AIME/HMMT style), the gains may be smaller relative to strong baselines (paper’s own discussion in Section 5.2), so the added complexity might not be cost-effective.
  • Training-wise, the excerpt indicates a staged recipe:
    1. generate trajectories (variant of Intern-S1) and filter for structured lemma summaries (Section 5.1; Section 4.2),
    2. behavioral cloning cold start (Eq. (4)),
    3. online RL with PV-based rewards + lemma-graph advantages (OREAL-H) (Section 4.3; Algorithm 1), using reported hyperparameters (Appendix B.2: AdamW, LR 5e−7, cosine schedule, β=0.01, batch=64 questions, 16 rollouts/question, max 65,536 tokens/rollout).