Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving¶
ArXiv: 2512.10739
🎯 Pitch¶
The paper introduces Intern-S1-MO, a multi-agent, multi-round reasoning system that replaces single-pass chain-of-thought with hierarchical lemma-based memory, enabling LRMs to explore vastly longer proof trajectories by storing and reusing verified intermediate lemmas. It also presents OREAL-H, an RL training framework that leverages process-verifier feedback and lemma dependency graphs for improved credit assignment across rounds—together delivering state-of-the-art results on Olympiad-level benchmarks (e.g., 26/35 on non-geometry IMO2025) and demonstrating a practical path beyond fixed context limits for complex mathematical reasoning.
1. Executive Summary (2-3 sentences)¶
Intern-S1-MO is a long-horizon mathematical reasoning agent that targets Olympiad-level problems by replacing a single long chain-of-thought with multi-round, hierarchical reasoning plus a compact “lemma memory” that stores reusable intermediate results (Figure 2). To train this multi-round behavior, the paper introduces OREAL-H, a reinforcement learning framework that uses process-verifier feedback and a lemma dependency graph for better credit assignment across rounds (Eq. (5)–(8), Algorithm 1). On the paper’s evaluations, the system reaches 26/35 points on non-geometry IMO2025 (pass@4) and 102/126 on CMO2025 under human grading (Table 3), while also improving scores on AIME2025/HMMT2025/CNMO2025 (Table 1).
2. Context and Motivation¶
- Problem / gap addressed
- Large Reasoning Models (
LRMs) improve math performance usingCoT(Chain-of-Thought) and RL with verifiable rewards (RLVR), but their performance is described as heavily dependent on long reasoning context length. -
For very hard problems (e.g., IMO), the reasoning complexity is framed as exceeding what an LRM can explore “in a single round,” even if the model has a large context window.
-
Why this matters
- The paper argues that as problem difficulty increases, both human thinking time and model token consumption grow sharply, creating a mismatch with practical context limits (Figure 1(a)).
-
It positions the central bottleneck as: context length is limited (often 64k–128k tokens) while IMO-style reasoning can require substantially more exploration and revisiting of partial progress (Introduction; Figure 1).
-
Prior approaches and where they fall short (as positioned in the paper)
- Multi-round interaction / parallel decoding / reflection via prompting: these broaden search or add self-correction, but are described as still operating within a single reasoning cycle that does not cumulatively build on earlier explorations across rounds (Introduction; Related Work §2.1).
- Formal-language proof/search systems: they can store structured intermediate results, but are described as requiring extensive state traversal and proof checking (high overhead) and incurring translation costs from informal to formal logic (Introduction).
-
Proprietary LRMs: reported strong IMO2025 performance, but the community lacks access to systematic methods and models (Introduction).
-
Paper’s positioning
- The paper presents a systematic agent architecture + training pipeline designed specifically for multi-round, lemma-driven exploration, aiming to “break through context constraints” on IMO-level problems (Introduction; Figure 2).
- It also positions
OREAL-Has addressing RL limitations for agents where rewards are delayed and process-verifier feedback is noisy (Section 4).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a math-problem-solving agent built around a base LRM, expanded into a multi-agent workflow that repeatedly reasons, summarizes intermediate results into lemmas, and verifies them.
- It solves long and difficult proofs by turning one long reasoning trace into multiple shorter rounds connected by a compact lemma memory, plus a training method (
OREAL-H) that learns from the agent’s own explored trajectories.
3.2 Big-picture architecture (diagram in words)¶
- Input: a math problem (and, from round 2 onward, a lemma library / memory).
- Reasoner agent: generates a solution attempt or partial progress for the current round.
- Summarizer agent: compresses the round’s reasoning trace into a set of candidate lemmas.
- Verifier agent(s):
- A theorem verifier assigns confidence to intermediate lemmas via multiple verification samples.
- A process verifier (from
OPV) evaluates final-solution rigor and provides feedback for revision. - Memory system: stores verified lemmas; feeds them back into the next reasoning round (Figure 2).
- Final refinement loop: iteratively revises the final solution using process-verifier feedback until it passes or a max loop count is reached (Figure 2).
3.3 Roadmap for the deep dive¶
- I will explain:
- The multi-round agent loop and why lemma memory matters for long-horizon reasoning (Section 3; Figure 2).
- The two verification layers (lemma-level and proof-process-level) and how they mitigate error propagation (Section 3).
- The Hierarchical MDP formulation used for training (Section 4.1; Eq. (1)–(3)).
OREAL-H’s two key training ideas: lemma dependency graphs for credit assignment and conjugate reward modeling for noisy process verification (Section 4.3; Eq. (5)–(8), Algorithm 1).- The inference/training budgets and hyperparameters the paper reports (Appendix B; Section 5).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical agent/system paper with an algorithmic RL component, and the core idea is to extend effective reasoning depth by splitting problem solving into multiple rounds connected by a verified lemma memory, then training the policy with RL that uses process-level feedback plus a graph-based progress signal.
3.4.1 Multi-round hierarchical reasoning with lemma memory (Figure 2; Section 3)¶
- The agent runs in discrete rounds rather than one long generation, because long IMO-style solutions require exploration, backtracking, and reuse of partial progress that may not fit in one context window (Introduction; Figure 1(a)).
- In each round, the reasoner produces either:
- a complete solution attempt, or
- a partial but rigorous set of intermediate results (lemmas), explicitly encouraged to avoid “premature conclusion bias” where the model rushes to a final answer when budget runs out (Section 3, “Decomposing Sub-Problems for Lemma Search”; Appendix A.1 prompt stresses partial solutions and rigor).
- After the reasoner finishes a round, the summarizer compresses the long trace into a structured set of sub-lemmas (Section 3, “Summarizing Exploration for Memory Maintenance”).
- The motivation is that raw traces contain redundant trial-and-error; subsequent rounds mostly benefit from the validated intermediate claims, not the full verbose exploration.
- These lemmas are stored in a lemma library (memory system) and then fed back into the next reasoning round alongside the original question (Figure 2).
- This creates a cumulative “scratchpad” effect: instead of re-deriving the same intermediate results, later rounds can start from previously established lemmas.
System/data pipeline diagram in words (explicit flow): 1. Round 1 starts with the Question alone as input. 2. The Reasoner agent generates a long-chain attempt (possibly partial). 3. The Summarizer agent reads the reasoning trace and outputs a list of candidate lemmas to store. 4. The Lemma verifier checks each lemma (via parallel sampling) and assigns a confidence score; only sufficiently trusted lemmas are added to memory (Section 3, “Verifying Theorems…”). 5. Round 2..(n−1) repeat steps 2–4, but now the input to the reasoner is Question + Lemma Library (Figure 2). 6. In the final round, the reasoner outputs a full solution draft. 7. The Process verifier (OPV) evaluates the final proof and produces natural-language feedback identifying flawed steps; the agent enters a revision loop to fix issues until verification passes or the loop limit is reached (Figure 2; Section 3, “Verifying Process for Final Proof Completion”).
3.4.2 Verification design: lemma-level and process-level (Section 3)¶
- Lemma verification to reduce error propagation
- The paper highlights a failure mode where an incorrect intermediate conclusion can mislead future reasoning (“circular reasoning or invalid proofs”), so it adds verification at the lemma level because checking a lemma is framed as easier than checking the entire solution (Section 3).
- Mechanism: for each lemma, the theorem verifier runs
nparallel verifications and uses the proportion of “correctly identified” outcomes as a confidence score (Section 3). -
The paper’s aim is to reduce false positives/negatives by voting-style aggregation, though the excerpt does not specify the exact verifier model architecture or thresholding policy used to accept/reject a lemma.
-
Process verification for final proof completion
- For end-to-end correctness and training feedback, the paper uses the
OPVprocess verifier, which provides step-level feedback in natural language (Section 3; Section 4.1 reward discussion). - The paper cites
OPVas having F1 > 85% on ProcessBench (Section 3), and uses it in two roles:- Test-time scaling: aggregate verification across multiple runs for robustness.
- Training signal / revision guidance: provide feedback for iterative revision and RL.
3.4.3 Training formulation: Hierarchical MDP + token-level policy gradient (Section 4.1; Eq. (1)–(3))¶
- The paper models the overall agent as a Hierarchical Markov Decision Process (MDP):
- State
s_tincludes problem context, reasoning trace, and verification feedback (Section 4.1). - High-level “meta-actions”
u_tinclude things like “extract lemmas,” “invoke verification,” and “commit answer” (Section 4.1). - Low-level actions are the generated token sequences
v_t = (v_{t,1}, …, v_{t,T_t})from the language model policyπ^L_θ(·|s_t)(Section 4.1). - The optimization objective is to maximize expected final reward
R(Eq. (1)): J(θ, φ) = E_{π^H_φ, π^L_θ}[R].- Here,
π^H_φdenotes the high-level policy (not fully detailed in the excerpt), andπ^L_θis the low-level token policy. - The advantage is computed using a value function
V(s_t)updated by temporal difference (Eq. (2)): V(s_t) ← E[r_t + γ V(s_{t+1})].- The token-level gradient aggregates log-likelihood gradients within each round weighted by the round advantage (Eq. (3)):
∇_θ J = E[ Σ_t A_t · Σ_τ ∇_θ log π^L_θ(v_{t,τ} | s_t, v_{t,<τ}) ].
Important constraint from the excerpt: the paper describes both a high-level and low-level policy, but the provided text does not give implementational details for π^H_φ (parameterization, action encoding, or how it interacts with the multi-agent modules), so any deeper description would be speculative.
3.4.4 Cold start: behavioral cloning on successful structured trajectories (Section 4.2; Eq. (4))¶
- Before online RL, the policy is initialized via behavioral cloning on filtered transitions where the output yields a well-formed lemma summary (syntactically valid, non-empty, logically segmented) (Section 4.2).
- The pretraining loss is standard token negative log-likelihood over these filtered trajectories (Eq. (4)):
L_RFT(θ) = - E_{(s_t, v_t) ~ D_init} [ Σ_τ log π^L_θ(v_{t,τ} | s_t, v_{t,<τ}) ].- The dataset
D_initis continuously augmented with outcome-filtered QA pairs “without previous thinking,” and the paper claims this improves discovery of positive trajectories during RL (Section 4.2).
3.4.5 OREAL-H: RL adaptations for multi-round agents (Section 4.3; Eq. (5)–(8), Algorithm 1)¶
The paper starts from OREAL (Outcome Reward RL) and introduces two key modifications to work with: (i) delayed credit assignment across rounds, and (ii) noisy continuous process-verifier rewards.
(A) Progress-conditioned advantage via lemma dependency graphs (Section 4.3.1; Eq. (5)–(6))
- Core idea: build a lemma dependency graph by aggregating lemmas produced across multiple rollouts for the same problem, then propagate success value backward from lemmas that lead to correct solutions.
- The paper defines a lemma value recursively (Eq. (5)):
- Plain language paraphrase: a lemma is valuable if it tends to lead to successor lemmas that eventually lead to a successful final solution.
- Notation: v(l) = E_{l' ∈ Succ(l)}[ v(l') ], where Succ(l) are lemmas derived from l.
- For a round t producing candidate lemma set L_t, the state value is taken optimistically as:
- V(s_t) = max_{l ∈ L_t} v(l) (Section 4.3.1).
- The advantage becomes a temporal-difference style improvement in best reachable lemma value (Eq. (6)):
- A_t = r_t + γ max_{l' ∈ L_{t+1}} v(l') − max_{l ∈ L_t} v(l).
- Rounds that produce no new lemmas are masked (advantage set to 0), so training focuses updates on rounds that yield “verifiable advances” (Section 4.3.1; Algorithm 1 lines 16–23).
- The paper claims this functions like a cheaper surrogate to MCTS-style value estimation, without extensive search overhead (Section 4.3.1; Figure 3 shows an example lemma graph).
(B) Conjugate reward modeling for noisy process verification (Section 4.3.2; Eq. (7)–(8))
- The process verifier (PV) is stochastic: passing k out of n checks is not treated as a clean probability of correctness (Section 4.3.2).
- The paper models latent solution quality p ∈ [0,1] with a Beta prior Beta(1,1) and updates after observing k successes out of n PV trials:
- Posterior: p | (k,n) ~ Beta(k+1, n-k+1) (Eq. (7)).
- Instead of using k/n directly, it defines reward as the probability the solution’s latent quality exceeds a “completely invalid” baseline that fails all checks:
- p1 ~ Beta(k+1, n-k+1) (current solution), p0 ~ Beta(1, n+1) (baseline),
- reward R(k,n) = P(p1 > p0) (Eq. (8)).
- In practice, the paper fixes n = 4 PV trials (Section 4.3.2).
- It reports R(4,4) ≈ 5.5 and baseline R(0,4) = 0, describing this as “99.5% dominance probability over the baseline,” with smoothly interpolated rewards for k = 1,2,3 (Section 4.3.2).
- Note: the excerpt’s numeric mapping (e.g., reward ≈ 5.5) implies a scaled reward rather than a pure probability in [0,1]; the text mixes “probability” language with a >1 numeric reward. The provided excerpt does not fully clarify this scaling, so the safest interpretation is: they compute a dominance probability and then use a scaled form as the RL reward.
(C) End-to-end OREAL-H loop (Algorithm 1)
1. Sample K multi-round trajectories per problem with the current policy.
2. Run PV to get k/n outcomes and compute conjugate reward.
3. Build lemma dependency graph across trajectories.
4. Backpropagate lemma values and compute progress-conditioned advantages.
5. Optimize the policy with an OREAL-style loss (Algorithm 1 line 27; details of the exact loss are referenced to [18] but not shown in the excerpt).
3.4.6 Budgets, inference scaling, and hyperparameters (Figure 1; Appendix B; Section 5.4)¶
Inference / test-time scaling budgets
- Figure 1(b) and the Introduction describe that Intern-S1-MO can use about 512k tokens to solve a single problem, motivated as ~8× extension over a 64k context constraint (Introduction; Conclusion).
- Default evaluation budgets (Appendix B.1):
- Max 8 inference rounds for reasoner and summarizer.
- Theorem verifier: 4 parallel verifications per lemma.
- Final iterative revision (process verifier loop): max 8 rounds.
- Max length of output for reasoner and summarizer: 64k tokens per output.
Training hyperparameters (Appendix B.2)
- Batch: 64 questions per batch.
- Rollouts: 16 rollouts per question.
- Max length per rollout trajectory: 65,536 tokens.
- Filtering: discard questions with overall pass rate 0 or 1.
- Optimizer: AdamW.
- Learning rate: 5e−7, cosine annealing schedule decaying to 1/5 of initial LR.
- KL coefficient β: 0.01.
Official CMO2025 participation budget (Section 5.4) - For each problem: - 256-shot parallel search over up to 12 rounds. - Intermediate lemmas: lemma verifier provides multiple rounds of 8-shot feedback. - Candidate solution refinement: 8-shot refinement procedure comprising 24 rounds, where OPV flags issues and the policy revises accordingly.
Model architecture details (missing in provided excerpt) - The excerpt does not specify core LRM architecture/configuration such as: - number of layers, hidden size, attention heads, tokenizer, context window during pretraining, hardware, total training tokens, or compute budget. - Because the user asked for a deep reading grounded only in provided text, these details cannot be supplied without guessing.
4. Key Insights and Innovations¶
- (1) Lemma-based compact memory enabling multi-round reasoning (Figure 2; Section 3)
- Novelty relative to single-pass CoT: instead of relying on one extended context, the system repeatedly compresses reasoning into reusable lemmas and re-injects them as memory in later rounds.
-
Significance: enables “unlimited exploration capability” in principle by keeping only essential validated statements, letting the agent explore lemma-rich solution spaces beyond a single context limit.
-
(2) Explicit separation of roles into reasoner / summarizer / verifier agents (Figure 2)
- Unlike prompt-only self-reflection, the paper operationalizes different cognitive tasks (explore vs compress vs check) as separate agent steps.
-
This is positioned as a systematic structure rather than ad-hoc prompting, and ablations suggest each part contributes (Table 2).
-
(3) Lemma verification via parallel sampling confidence scores (Section 3)
- The paper treats lemma verification as easier than full-proof verification, using multiple verifier runs and a confidence score to reduce error propagation.
-
This explicitly targets a known multi-step reasoning failure mode: compounding errors when later steps rely on flawed intermediate claims.
-
(4)
OREAL-H: RL credit assignment using lemma dependency graphs (Section 4.3.1; Eq. (5)–(6); Figure 3) - Novelty: assigns learning signal to rounds that make “verifiable progress” by estimating lemma values in a graph aggregated over rollouts, rather than relying only on final correctness.
-
Significance: the paper frames this as reducing variance and avoiding wasted updates on unproductive long trajectories.
-
(5) Conjugate reward model for noisy process verification (Section 4.3.2; Eq. (7)–(8))
- Novelty: replaces raw
k/nPV pass rates with a Bayesian dominance probability against an invalid baseline. - Significance: intended to stabilize learning when the verifier is stochastic and imperfect, preserving strong gradients for high-confidence passes while suppressing “lucky” passes.
5. Experimental Analysis¶
Evaluation methodology (Section 5.1; Appendix D)¶
- Benchmarks used
AIME2025(answer-focused)HMMT2025 Feb(answer-focused)IMO2025(non-geometry only)CNMO2025(non-geometry only)-
CMO2025(official participation, human-graded) -
Metrics
- For
AIME2025,HMMT2025,CNMO2025: pass@1 (with 16 independent rollouts and “unbiased pass@1” per [5]) (Section 5.1). - For
IMO2025: pass@4 (Section 5.1; Table 1). -
For proof-oriented datasets (
CNMO2025,IMO2025): an LLM-based fine-grained grading scheme inspired by MathArena, with modifications to better bind sub-propositions to proof obligations (Appendix D).- Ensemble evaluation: each solution is graded across N = 8 independent runs; final score is the mean across runs (Appendix D).
-
Baselines (Section 5.1; Table 1)
Gemini2.5-pro,o3-high,Grok4,GPT-OSS-120B,DeepSeek-R1-0528,Qwen3-235B-A22B.- For some baseline scores on AIME/HMMT, the paper reuses numbers from technical reports / MathArena (Table 1 caption; Section 5.1).
Main quantitative results (Table 1; Table 3)¶
Table 1 (overall benchmark comparison)
- Intern-S1-MO:
- HMMT2025: 95.0
- AIME2025: 96.6
- CNMO2025: 232.4
- IMO2025 (non-geometry, pass@4): 26
- Intern-S1-mini-MO:
- HMMT2025: 79.2
- AIME2025: 87.3
- CNMO2025: 176.3
- IMO2025: 17
- Best baseline numbers shown in Table 1 include:
- HMMT2025: baseline best is Grok4 at 92.5, while Intern-S1-MO is 95.0.
- AIME2025: baseline best is GPT-OSS-120B at 92.5 (Table 1), while Intern-S1-MO is 96.6.
- CNMO2025: baseline best is Gemini2.5-pro at 157.5, while Intern-S1-MO is 232.4.
- IMO2025: baselines shown range up to 14 (Gemini2.5-pro and Qwen3-235B-A22B), while Intern-S1-MO is 26.
Table 3 (official CMO2025 participation)
- Total: 102 / 126
- Per problem: P1=21, P2=21, P3=9, P4=21, P5=21, P6=9.
- The paper states this exceeds a gold threshold of 78 points (Section 5.4).
Ablation studies (Table 2)¶
- The ablation isolates additive contributions on
HMMT2025,AIME2025,CNMO2025: Single-round with Agents: 70.8 / 81.9 / 178.0+ Multi-round Reasoning: 85.4 / 91.0 / 201.7+ Theorem Verifier: 86.3 / 93.3 / 203.0+ Process Verifier: 89.1 / 94.0 / 215.2+ OReal-H: 95.0 / 96.6 / 232.4- Interpretation grounded in Table 2:
- The largest single jump comes from adding multi-round reasoning (e.g., CNMO2025: 178.0 → 201.7).
- Verification components add incremental improvements.
OREAL-Hadds a substantial final gain (e.g., CNMO2025: 215.2 → 232.4).
Do the experiments support the claims?¶
- The provided results convincingly support the narrower claim that:
- among the models compared in Table 1,
Intern-S1-MOachieves the best reported scores on these benchmarks, and - each architectural/training component contributes in the direction expected (Table 2).
- Two caveats that remain based on the excerpt:
- For some baselines on AIME/HMMT, scores are imported from other reports/MathArena rather than re-run under identical conditions (Table 1 caption), which can complicate strict apples-to-apples comparisons.
- The excerpt does not provide confidence intervals or statistical tests; variability is mentioned for IMO2025 due to only 5 problems (Section 5.3).
6. Limitations and Trade-offs¶
- High inference cost / test-time scaling
- The system is explicitly designed to scale test-time compute (Figure 1; Appendix B.1; Section 5.4), including up to 512k tokens per problem and large parallel search budgets (e.g., 256-shot in CMO2025).
-
This is a trade-off: performance gains may depend on substantial inference-time resources.
-
Verifier dependence and noise
- The approach relies on theorem verification and process verification to prevent error propagation (Section 3).
-
Even with conjugate reward denoising, PV feedback is acknowledged as noisy (Section 4.3.2), and verifier errors could still steer search/training.
-
Missing architectural reproducibility details (from the provided excerpt)
-
The excerpt does not include core model architecture parameters (layers, dimensions, tokenizer, pretraining compute/tokens, hardware), limiting full reproducibility from this text alone.
-
Scope restrictions in evaluation
- For
IMO2025andCNMO2025, the evaluation explicitly excludes geometry problems (Section 5.1). -
Therefore, conclusions in the excerpt primarily support non-geometry Olympiad problem solving.
-
Potential memory contamination / lemma quality management
- The paper’s own motivation emphasizes error propagation; despite lemma verification, storing incorrect or ambiguous lemmas could still degrade later rounds.
- The excerpt does not specify acceptance thresholds or memory pruning strategies beyond verification/voting, so it is unclear how robust the lemma library is under adversarial or highly ambiguous intermediate claims.
7. Implications and Future Directions¶
- How this changes the landscape
- The work suggests that pushing beyond context limits for Olympiad math may be less about ever-longer single-pass contexts and more about agentic decomposition + durable, verified intermediate memory (Figure 2; Conclusion).
-
It also frames process verification not only as an evaluator but as a training signal and revision guide, integrating verification into the reasoning loop (Section 3; Section 4).
-
Follow-up research enabled
- Better lemma memory management: selection, deduplication, contradiction detection, and retrieval policies are natural next steps, since lemma quality is central.
- Stronger graph-based credit assignment: the lemma dependency graph idea (Figure 3) could be expanded with richer edge definitions or more principled value backups, but the excerpt does not detail these choices.
-
Improving verifier calibration: since PV noise is important enough to motivate conjugate rewards, improved calibration or joint training of verifier + policy could further stabilize learning.
-
Practical applications / downstream use
- A system like this can be applied where solutions require long proofs with intermediate milestones: competition math, proof drafting, and iterative refinement workflows (Conclusion’s “math research” motivation).
-
The paper’s official competition participation (CMO2025) indicates an intended use case: real-world timed settings with human-style proof submissions (Section 5.4).
-
Repro/Integration Guidance (based on the provided excerpt)
- Prefer this agentic approach over single-pass prompting when:
- the problem requires building many intermediate claims and revisiting them across multiple attempts (IMO-style), and
- you can afford multi-round inference budgets and verification overhead (Appendix B.1; Section 5.4).
- If you only need final-answer tasks (AIME/HMMT style), the gains may be smaller relative to strong baselines (paper’s own discussion in Section 5.2), so the added complexity might not be cost-effective.
- Training-wise, the excerpt indicates a staged recipe:
- generate trajectories (variant of
Intern-S1) and filter for structured lemma summaries (Section 5.1; Section 4.2), - behavioral cloning cold start (Eq. (4)),
- online RL with PV-based rewards + lemma-graph advantages (
OREAL-H) (Section 4.3; Algorithm 1), using reported hyperparameters (Appendix B.2: AdamW, LR 5e−7, cosine schedule, β=0.01, batch=64 questions, 16 rollouts/question, max 65,536 tokens/rollout).
- generate trajectories (variant of