Skip to content

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

ArXiv: 2511.22570

🎯 Pitch

The paper introduces a rubric-driven LLM verifier (plus a meta-verifier) and uses it as a learned reward to train a proof generator that self-critiques and iteratively repairs its own proofs, closing the generation–verification gap. This self-verification loop enables scaling test-time verification without reference solutions, producing much more faithful formal reasoning and delivering gold-level performance on hard contests (e.g., IMO 2025, Putnam 2024), which demonstrates practical steps toward trustworthy, research-grade mathematical AI.


1. Executive Summary (2-3 sentences)

DeepSeekMath-V2 targets a central failure mode in math-focused LLMs: getting the right final answer (or a plausible-looking proof) does not ensure the reasoning is valid, and many theorem-proving tasks do not even have a checkable final numeric answer. The paper builds a trainable, rubric-driven proof verifier (plus a meta-verifier that checks whether the verifier’s criticisms are real) and then uses that verifier as a reward model to train a proof generator that explicitly self-critiques and iteratively fixes its own proofs (Sections 2.1–2.3, Eqs. (1)–(6)). With scaled test-time generation+verification compute, it reports gold-level contest performance (e.g., 5/6 solved on IMO 2025; 118/120 on Putnam 2024) (Section 3.3.3, Table 1).

2. Context and Motivation

  • Problem/gap addressed
  • Reinforcement learning (RL) for math reasoning often rewards final answer correctness on quantitative problems, which is easy to verify when a ground-truth answer exists (Introduction).
  • This creates two fundamental gaps (Introduction):

    • A correct final answer can be produced via flawed logic (“correct answers don’t guarantee correct reasoning”).
    • The reward is inapplicable to theorem proving where the goal is a rigorous step-by-step proof, often without a single numeric answer.
  • Why it matters

  • Theorem proving and proof-like reasoning require faithful derivations; unreliable proofs block use in high-stakes scientific/math settings (Introduction).
  • Self-verification is framed as especially important when scaling test-time compute on open problems with unknown solutions: you need a way to increase confidence without reference answers (Introduction).

  • Prior approaches and shortcomings (as positioned in this paper)

  • Final-answer-based RL scales quantitative benchmarks but does not teach models to reliably verify proofs; models show “high false-positive rates” for invalid natural-language proofs (Introduction).
  • Formal proof assistants (e.g., Lean/Isabelle) can guarantee correctness but require formalization; informal proofs are historically “hard to verify automatically” (Related Work).

  • How this paper positions itself

  • The paper proposes a practical route to create and maintain a generation–verification gap in natural language theorem proving by:
    • training a strong verifier,
    • using it as a reward model for proof generation,
    • and continuously improving the verifier using automatically-labeled hard cases created by scaled verification compute (Introduction; Sections 2.1–2.3).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a pair of LLM roles—verifier and generator—trained so the model can both write proofs and critique proofs in a structured, rubric-based way.
  • It solves theorem-proving-without-reference-answers by replacing “final answer reward” with “proof-quality reward,” and by using scaled verification and iterative refinement at test time (Sections 2 and 3.3.2–3.3.3).

3.2 Big-picture architecture (diagram in words)

  • Inputs: a math problem X and a candidate proof Y.
  • Verifier π_φ: reads (X, Y, I_v) and outputs (i) an issues summary and (ii) a discrete proof score in {0, 0.5, 1} (Section 2.1.1).
  • Meta-verifier π_η: reads (X, Y, V, I_mv) where V is the verifier’s analysis, and scores whether the verifier’s cited issues are real/justified (Section 2.1.2).
  • Generator π_θ: reads a problem X and outputs a proof Y plus a self-analysis Z; training rewards combine proof quality and the faithfulness of the self-analysis (Section 2.2.2).
  • Auto-labeling loop: scaled sampling of verification + meta-verification produces labeled training data to further improve the verifier as the generator improves (Section 2.3).

3.3 Roadmap for the deep dive

  • Explain the proof scoring rubric and verifier output format because everything downstream depends on what “quality” means (Section 2.1.1; Appendices A.2).
  • Detail verifier RL training and why naive score-only training can hallucinate issues (Eqs. (1)–(2); Section 2.1.2).
  • Introduce meta-verification and how it modifies verifier rewards to improve faithfulness (Eq. (3); Section 2.1.2).
  • Describe generator RL training first with verifier-as-reward, then with self-verification reward shaping that incentivizes honest self-critique (Eqs. (4)–(6); Section 2.2).
  • Close with the synergistic data flywheel and the scaled test-time search procedure (Section 2.3; Section 3.3.3).

3.4 Detailed, sentence-based technical breakdown

  • Framing: This is an RL-and-systems style paper that builds an iterative training loop for natural-language theorem proving by making proof verification itself a trainable capability and then using it as the reward signal for proof generation (Introduction; Section 2).

  • Core objects and rubrics

  • The paper defines a proof evaluation rubric I_v (referenced as Appendix A.2) and trains a verifier π_φ(· | X, Y, I_v) that produces (i) a written analysis listing issues and (ii) a three-level score:
    • 1 = complete, rigorous proof with all steps justified,
    • 0.5 = overall sound but minor errors/omissions,
    • 0 = fatal logical errors or critical gaps (Section 2.1.1).
  • The verifier is required to emit a specific output format containing the phrase Here is my evaluation of the solution: and a final score inside \boxed{} after Based on my evaluation, the final overall score should be: (Section 2.1.1; Appendix A.2 shows the template).

  • Cold-start data construction for verifier training

  • The paper crawls proof-required contest problems from AoPS, prioritizing olympiads/team selection tests/post-2010, totaling 17,503 problems; this problem set is named D_p (Section 2.1.1).
  • It generates candidate proofs using a “variant of DeepSeek-V3.2-Exp-Thinking,” and prompts iterative refinement over multiple rounds because the base model tends to be concise and error-prone (Section 2.1.1).
  • Mathematical experts then score sampled proofs across problem types, yielding an initial labeled dataset D_v = {(X_i, Y_i, s_i)} with s_i ∈ {0, 0.5, 1} (Section 2.1.1).

  • Verifier RL objective (score prediction + format enforcement)

  • The verifier is trained (starting from a DeepSeek-V3.2-Exp-SFT checkpoint fine-tuned on math/code reasoning) with RL to output analyses whose extracted score s'_i matches the expert label s_i (Section 2.1.1).
  • The reward has two factors (Section 2.1.1):
    • R_format: an indicator that checks the required textual markers and score boxing appear.
    • R_score(s'_i, s_i) = 1 − |s'_i − s_i| (Eq. (1)), which gives full reward at exact match and linearly penalizes mismatch.
  • The RL objective maximizes the expected product R_format(V'_i) · R_score(s'_i, s_i) over samples from D_v (Eq. (2)), where V'_i is the verifier’s full response and s'_i is parsed from it.

  • Why score-only training is vulnerable: hallucinated issues

  • A key failure mode is that a verifier can get the score right while inventing nonexistent issues, because the training signal in Eqs. (1)–(2) only directly supervises the final score, not the correctness of the issue list (Section 2.1.2).
  • This undermines trust: if the verifier claims there are issues when there aren’t, the generator may “fix” the wrong things, or users may lose confidence in verification.

  • Meta-verification: verifying the verifier

  • The paper introduces meta-verification, governed by rubric I_mv (Appendix A.3), which evaluates whether the verifier’s identified issues truly exist and justify the assigned score (Section 2.1.2).
  • Data: experts score verifier responses, forming D_mv = {(X_i, Y_i, V_i, ms_i)} with ms_i ∈ {0, 0.5, 1}, where V_i is the verifier’s analysis and ms_i is the quality score of that analysis (Section 2.1.2).
  • Training: a meta-verifier π_η(· | X, Y, V, I_mv) is trained with the same style of format+score RL objective as the verifier (Section 2.1.2).
  • Integration: verifier training reward is augmented to
    R_V = R_format · R_score · R_meta (Eq. (3)), where R_meta is the meta-verifier’s quality score for the verifier’s analysis.
  • Reported internal validation: on a validation split of D_v, the meta-verifier’s average quality score for verifier analyses improves from 0.85 to 0.96 while maintaining the same proof-score prediction accuracy (Section 2.1.2).

    • This is important because it is the paper’s direct evidence that meta-verification improves faithfulness of issue identification, not just score calibration.
  • Training the proof generator using the verifier as a reward model

  • With the verifier π_φ acting as a “generative reward model,” the generator π_θ(·|X) is trained to maximize expected verifier score (Section 2.2.1).
  • The objective is
    max_{π_θ} E_{X_i∼D_p, Y_i∼π_θ(·|X_i)} [R_Y] (Eq. (4)),
    where R_Y is the score output by the verifier when evaluating (X_i, Y_i, I_v).

  • Self-verification training: forcing the generator to be honest about its own proofs

  • The paper observes a gap: a generator can improve when an external verifier critiques it, but when asked to critique itself “in one shot,” it often falsely claims correctness even when the external verifier finds flaws (Section 2.2.2).
  • To fix this, training forces the generator to output both:
    • a proof Y, and
    • a self-analysis Z that follows the same rubric/format as the verifier (Appendix A.1; Section 2.2.2).
  • The verifier then evaluates both parts (Section 2.2.2):
    • Proof reward R_Y = s (the verifier’s score for Y).
    • Self-analysis reward R_Z, which combines:
    • R_score(s', s) where s' is the generator’s self-assigned score extracted from Z, and
    • R_meta(Z) = ms, a meta-verification score for whether the self-analysis is faithful/justified (Eq. (6)).
  • The combined reward is
    R = R_format(Y, Z) · (α·R_Y + β·R_Z) (Eq. (5)), with α = 0.76 and β = 0.24 (Section 2.2.2).
  • The incentive structure is explicitly spelled out (Section 2.2.2):

    • Admitting errors is rewarded over falsely claiming perfection.
    • Highest reward requires both a correct proof and accurate recognition of rigor.
    • A rational strategy is to find and fix as many issues as possible before finalizing.
  • Synergistic loop + fully automated labeling via scaled verification

  • As the generator improves, it produces harder-to-verify proofs; the verifier must improve to keep the “generation–verification gap” (Section 2.3).
  • Manual proof labeling becomes harder as mistakes become subtle, so the paper scales verification compute to create labels (Section 2.3):
    1. For each proof, sample n independent verification analyses.
    2. For analyses scoring 0 or 0.5, sample m meta-verification assessments to validate that the issues are real; accept an analysis if a majority confirm it.
    3. Look at the lowest-score analyses; if at least k such analyses are validated, label the proof with that lowest score; if no legitimate issues are found across attempts, label as 1; otherwise discard or route to humans (Section 2.3).
  • The paper reports that in the last two training iterations, this pipeline “replaced human annotation entirely,” and quality checks showed automated labels aligned with experts (Section 2.3).

    • Note: the excerpt does not provide the actual numeric values of n, m, or k.
  • Training procedure across iterations

  • RL algorithm: GRPO (Group Relative Policy Optimization) is used for RL training (Section 3.1).
  • Iterative schedule (Section 3.1):

    • Each iteration: (i) optimize proof verification, then (ii) initialize the generator from the verifier checkpoint and optimize proof generation.
    • From iteration 2 onward: initialize the verifier from a checkpoint that consolidates verification+generation capabilities from the previous iteration via rejection fine-tuning (Section 3.1).
  • Worked micro-example (to make the mechanism concrete)

  • Suppose a generator outputs a proof Y and self-analysis Z with self-score s' = 1.
  • If the verifier judges the proof is only partially correct, s = 0.5, then:
    • The self-score accuracy term is R_score(s', s) = 1 − |1 − 0.5| = 0.5 (Eq. (1)).
    • If the meta-verifier finds the self-analysis is unfaithful (e.g., it failed to acknowledge the real gap), R_meta(Z) would be low (in {0, 0.5, 1} per Section 2.1.2’s scheme), and thus R_Z becomes small (Eq. (6)).
  • This makes “claiming perfection when imperfect” systematically worse than “admitting issues and matching the verifier’s score,” which is the behavior the paper seeks (Section 2.2.2).

  • Missing implementation details (explicitly not provided in the excerpt)

  • The paper states the model is “built on DeepSeek-V3.2-Exp-Base” (Introduction), but the provided content does not include:
    • model architecture specifics (layers, hidden size, heads),
    • tokenizer details,
    • optimizer hyperparameters / learning-rate schedule / batch size,
    • total training tokens, hardware, or compute budget.
  • Because your instruction requires grounding in the provided paper, I cannot supply those details without speculation.

4. Key Insights and Innovations

  • (1) Rubric-based proof verification trained with RL, not answer matching
  • Novelty: replaces “final answer correctness” with an explicit, discrete proof-quality signal {0, 0.5, 1} tied to rigor/completeness (Section 2.1.1).
  • Significance: makes theorem-proving tasks rewardable even when no numeric answer exists (Introduction; Section 2).

  • (2) Meta-verification to reduce hallucinated critiques

  • Novelty: trains a second model to judge whether the verifier’s identified issues are real and sufficient to justify the score, then multiplies this into the verifier’s reward (Section 2.1.2; Eq. (3)).
  • Significance: directly targets faithfulness of verification, not just score calibration; the paper reports an increase in meta-evaluated analysis quality from 0.85 to 0.96 (Section 2.1.2).

  • (3) Self-verification reward shaping for the generator (proof + self-analysis)

  • Novelty: trains the generator to output both the proof and a verifier-style critique, and rewards it for (i) proof quality and (ii) accuracy/faithfulness of its self-critique (Section 2.2.2; Eqs. (5)–(6)).
  • Significance: addresses the observed failure where generators overclaim correctness when self-checking “in one shot” (Section 2.2.2).

  • (4) Compute-scaled automatic labeling to keep pace with a stronger generator

  • Novelty: uses multi-sample verification plus meta-verification majority votes to label new, hard-to-verify proofs, enabling verifier improvement without human annotation in later iterations (Section 2.3).
  • Significance: proposes a mechanism for sustained improvement when manual labeling becomes the bottleneck (Section 2.3).

  • (5) Test-time scaled search guided by verification

  • Novelty: a concrete pool-based refinement procedure that repeatedly selects high-scoring proofs and refines them using issue-focused analyses until proofs pass many independent verification attempts (Section 3.3.3).
  • Significance: ties self-verification directly to scaling test-time compute for harder problems (Introduction; Section 3.3.3).

5. Experimental Analysis

Evaluation methodology (datasets, metrics, setup)

  • Benchmarks (Section 3.2)
  • In-house CNML-level set: 91 theorem-proving problems across categories:
    • algebra (13), geometry (24), number theory (19), combinatorics (24), inequality (11).
  • Competition/problem sets:

    • IMO 2025 (6), CMO 2024 (6), Putnam 2024 (12), ISL 2024 (31),
    • IMO-ProofBench (60), split into Basic (30) and Advanced (30) (Section 3.2).
  • Primary metric used throughout

  • A discrete proof score in {0, 0.5, 1} produced by the paper’s trained verifier (Section 2.1.1).
  • For some evaluations, correctness/score is based on majority voting across multiple verifier analyses (Section 3.3.1 and 3.3.2).

  • One-shot generation evaluation (Section 3.3.1)

  • For the in-house set:
    • Generate 8 proof samples per problem for each model.
    • Measure correctness via majority voting across 8 verification analyses produced by the final verifier (Section 3.3.1).
  • Result summary (Figure 1): DeepSeekMath-V2 shows higher mean proof scores than GPT-5-Thinking-High and Gemini 2.5-Pro across all five categories.

    • From Figure 1 (mean proof score by category):
    • Algebra: Gemini 0.35, GPT-5 0.54, DeepSeekMath-V2 0.60.
    • Geometry: Gemini 0.17, GPT-5 0.15, DeepSeekMath-V2 0.52.
    • Number theory: Gemini 0.32, GPT-5 0.36, DeepSeekMath-V2 0.54.
    • Combinatorics: Gemini 0.32, GPT-5 0.35, DeepSeekMath-V2 0.47.
    • Inequality: Gemini 0.45, GPT-5 0.38, DeepSeekMath-V2 0.59.
    • These are verifier-evaluated scores; the excerpt does not describe independent human grading for this in-house benchmark.
  • Sequential refinement with self-verification (Section 3.3.2)

  • Setup on IMO Shortlist 2024:
    • Launch 32 independent refinement threads per problem.
    • Vary maximum sequential iterations from 1 to 8 (where 1 = no refinement; 8 = initial + up to 7 refinements) (Figure 2 caption).
    • Measure final proof correctness by majority voting across 32 verifier analyses (Section 3.3.2).
  • Metrics reported (Section 3.3.2; Figure 2):
    • Pass@1: average score of the final proof from each thread.
    • Best@32: score of best proof per problem selected using the generator’s self-assigned scores.
  • Quantitative trend (Figure 2):
    • Pass@1 rises from 0.15 (1 iter) to 0.27 (8 iters).
    • Best@32 rises from 0.26 (1 iter) to 0.42 (8 iters).
  • Interpretation grounded in the paper’s claim: the gap between Best@32 and Pass@1 is used as evidence that the generator’s self-scores help pick better proofs, i.e., it can “accurately assess proof quality” (Section 3.3.2).

  • High-compute search (scaled generation + verification) (Section 3.3.3)

  • Procedure (Section 3.3.3):
    • Maintain a pool of candidate proofs per problem.
    • Initialize with 64 proof samples, each with 64 verification analyses.
    • Iteratively:
    • Select the 64 highest-scoring proofs by average verification score.
    • Pair each proof with 8 randomly selected analyses, prioritizing issue-finding ones (scores 0 or 0.5).
    • Generate one refined proof per proof-analysis pair to update the pool.
    • Run up to 16 iterations or stop early if a proof “passes all 64 verification attempts.”
    • Uses a single model (the final generator) to do both generation and verification at test time (Section 3.3.3).
  • Human validation: mathematical experts assess the highest-scoring proofs (Section 3.3.3).
  • Results (Table 1; Section 3.3.3):
    • IMO 2025: solves P1–P5 fully (with P6 not fully solved), total 83.3% points.
    • CMO 2024: fully solves P1, P2, P4, P5, P6 and gets partial on P3, total 73.8% points.
    • Putnam 2024: fully solves A1–B4, B5, B6 with one remaining problem having “minor errors,” total 98.3% points and 118/120 (Section 3.3.3).
  • Additional benchmark: IMO-ProofBench (Figure 3; Section 3.3.3).
    • Figure 3 reports expert evaluation percentages on Basic/Advanced subsets.
    • DeepSeekMath-V2 (Heavy): 99.0 (Basic) and 61.9 (Advanced).
    • For comparison values shown in Figure 3 include (Basic / Advanced):
    • Gemini Deep Think (IMO Gold): 89.0 / 65.7
    • Gemini Deep Think (IMO lite): 83.8 / 37.6
    • GPT-5: 59.0 / 20.0
    • Gemini 2.5 Pro: 55.2 / 17.6
    • The text notes all results are sourced from Luong et al. (2025) except DeepSeekMath-V2, which is evaluated by their experts following the grading guidelines (Figure 3 caption).

Do the experiments support the claims?

  • The experiments directly support the paper’s core mechanism claims:
  • Self-verification improves results with more compute: Figure 2 shows monotonic improvements in both Pass@1 and Best@32 as sequential iterations increase (Section 3.3.2).
  • Verifier-guided search can solve very hard problems: Table 1 + expert assessment provides evidence that scaled verification+generation can reach gold-level contest scores (Section 3.3.3).
  • However, some evidence is internally defined:
  • Many metrics rely on the paper’s own verifier for scoring (Sections 3.3.1–3.3.2), which is appropriate for scaling but introduces a dependency: if the verifier has blind spots, the reported mean scores could be optimistic. The paper partially addresses this by using expert grading for key headline results (Section 3.3.3; Figure 3 for ProofBench uses expert evaluation).

Ablations / robustness / failure cases

  • In the provided excerpt, there are no explicit ablation tables (e.g., removing meta-verification, changing α/β, removing R_format) beyond:
  • the reported improvement of verifier analysis quality 0.85 → 0.96 with meta-verification (Section 2.1.2),
  • and the iteration/compute scaling curves in Figure 2.
  • The paper does note a practical failure mode: for IMO/CMO-level problems, one-shot generation may fail within the 128K token limit, and even when the generator recognizes invalidity, it may lack context length to resolve all issues in one attempt (Section 3.3.2).

6. Limitations and Trade-offs

  • Verifier dependence / self-referential evaluation
  • Many reported proof-quality scores (especially in-house CNML and sequential refinement curves) rely on the model’s own trained verifier plus voting (Sections 3.3.1–3.3.2). This creates a risk of shared blind spots between generator and verifier, particularly because the final system “uses a single model” for both roles at test time in high-compute search (Section 3.3.3).

  • Discrete scoring granularity

  • The score space {0, 0.5, 1} is coarse (Section 2.1.1). Coarse rewards can be easier to learn but may:

    • under-incentivize incremental rigor improvements within the same bucket,
    • or collapse distinct error types (minor omission vs subtle logical flaw) into the same 0.5.
  • Compute and complexity trade-off

  • The approach explicitly scales test-time compute using many samples (e.g., 64 proofs × 64 analyses, iterative pool refinement up to 16 rounds) (Section 3.3.3). This is powerful but expensive, and the excerpt does not quantify runtime/latency/cost.

  • Missing reproducibility-critical training details (in the provided content)

  • The excerpt does not specify core training hyperparameters (optimizer settings, LR schedule, batch sizes), model architecture, total tokens, or hardware. This limits the ability to reproduce results from the text provided.

  • Context-length constraints remain

  • The paper explicitly mentions the 128K token limit as a barrier to producing fully rigorous proofs for difficult problems in one attempt (Section 3.3.2), implying that even with self-verification, some problems may require multi-step external refinement.

  • Auto-labeling parameters unspecified

  • The automated labeling pipeline relies on n verifier samples, m meta-verifier samples, and a threshold k (Section 2.3), but the excerpt does not provide their numeric values. Without them, it is hard to assess label reliability vs compute cost.

7. Implications and Future Directions

  • How this changes the landscape
  • The paper provides a concrete recipe for scaling theorem-proving ability without ground-truth answers by making verification a first-class learnable skill and using it to drive both training and test-time search (Introduction; Sections 2–3).
  • It suggests that “self-verifiable reasoning” can be approached via an iterative generator–verifier co-evolution loop rather than relying solely on externally checkable answers (Introduction; Section 2.3).

  • Follow-up research enabled/suggested by the paper

  • Stronger faithfulness guarantees: meta-verification improves analysis quality (Section 2.1.2), but future work could explore richer rubrics or more fine-grained scoring beyond {0, 0.5, 1}.
  • Better calibration of “no issues found” as confidence: the paper operationalizes confidence as passing many independent verification attempts (Section 3.3.3). Future work could study how verification-sample-count correlates with true correctness under expert grading.
  • Reducing compute: since the best results use heavy sampling and multiple iterations (Sections 3.3.2–3.3.3), a natural direction is distilling high-compute search into lower-cost policies.

  • Practical applications / downstream use cases

  • Competition-style theorem proving in natural language (Sections 3.2–3.3).
  • More broadly, any setting where solutions are hard to verify by final answers alone, and where iterative critique-and-repair is feasible (Introduction).

  • Repro/Integration Guidance (based on what’s provided)

  • Prefer this method over final-answer RL when:
    • tasks require proofs or derivations rather than numeric outputs (Introduction),
    • reference solutions are absent or verification is otherwise hard, making self-verification valuable for scaling test-time compute (Introduction; Section 3.3.3).
  • The paper’s training decomposition is operationally clear even without full hyperparameters:
    • Train verifier on expert-scored proof data with format+score reward (Section 2.1.1).
    • Add meta-verifier and incorporate its score into verifier training to reduce hallucinated issues (Section 2.1.2; Eq. (3)).
    • Train generator with verifier reward (Section 2.2.1; Eq. (4)).
    • Train generator to output proof + self-analysis, rewarding both correctness and faithful self-assessment with weights α=0.76, β=0.24 (Section 2.2.2; Eqs. (5)–(6)).
    • Use scaled verification+generation to (i) solve hard problems and (ii) auto-label new training data to keep improving verification (Sections 2.3 and 3.3.3).