The Lessons of Developing Process Reward Models in Mathematical Reasoning¶

🎯 Pitch¶

The paper diagnoses why common Monte Carlo (MC) estimation for step-level labeling and Best-of‑N (BoN) evaluation mislead Process Reward Models (PRMs), showing MC labels behave like value estimates and cause PRMs to tolerate outcome‑correct but process‑flawed solutions. It introduces a simple consensus filtering pipeline that combines MC estimation with LLM-as-a-judge and demonstrates substantially improved step‑error localization on PROCESSBENCH while preserving strong BoN selection performance—making PRMs more reliable for process‑aware verification and downstream selection/search.

1. Executive Summary (2-3 sentences)¶

This paper analyzes why many existing Process Reward Models (PRMs) for math reasoning underperform at their intended job—verifying intermediate reasoning steps—despite looking strong under the common Best-of-N (BoN) selection evaluation. It shows that popular Monte Carlo (MC) estimation-based step labeling can be noisy and can push PRMs toward outcome-like behavior, and it proposes a simple consensus filtering pipeline combining MC labeling with LLM-as-a-judge step verification that improves step-error localization on PROCESSBENCH while maintaining strong BoN results (Sections 3–4; Figures 2–4; Tables 6–7).

2. Context and Motivation¶

Problem / gap addressed
Large Language Models (LLMs) can reach correct final answers while containing wrong or fabricated intermediate reasoning steps, which undermines trust in the reasoning process (Introduction, Section 1; Figure 6).
PRMs aim to score each step in a solution so that systems can detect where reasoning first goes wrong and use those signals for selection or search (Section 1; Section 2.2 “PROCESSBENCH”).
Why this matters
If an evaluation only checks final answers, models can be rewarded for “answer-correct but reasoning-wrong” behavior (Section 3.2.1; Figure 6).
In downstream usage (e.g., selecting among multiple sampled solutions), a PRM is supposed to favor solutions whose process is correct, not just the outcome (Sections 1, 3.2).
Prior approaches and where they fall short
Human step annotation can produce high-quality supervision but is expensive (Section 1; Section 3.1.2 uses PRM800K as the human-annotated reference).
MC estimation labels a step by sampling multiple completions from that step and checking how often the final answer is correct (Section 2.1; Section 3.1). The paper argues this is fundamentally closer to a value estimate (“probability of eventual success”) than a deterministic step correctness label (Section 3.1.1).
BoN evaluation is widely used to measure PRM utility by picking the best of N candidate solutions scored by the PRM (Section 2.2). The paper shows BoN can be biased because (i) policy models often produce correct answers with flawed processes, and (ii) PRMs may tolerate these, inflating BoN scores (Section 3.2.1–3.2.2; Figure 7; Table 5).
How the paper positions itself
It is an empirical “lessons learned” study backed by extensive comparisons across labeling methods (MC vs LLM-as-judge vs human), evaluation types (BoN vs PROCESSBENCH), and model scales (7B and 72B) (Sections 2–4; Tables 1–7; Figures 1–9).
It proposes a lightweight remedy—consensus filtering—and releases PRMs trained with that data pipeline (Section 4; Figure 2; Tables 6–7).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a training-and-evaluation pipeline for a model (PRM) that outputs a correctness score for each reasoning step in a math solution.
It solves “how do we build and evaluate step-verifiers for math reasoning” by (i) generating candidate solutions, (ii) assigning step labels using automated signals, (iii) training a step classifier, and (iv) evaluating both answer-selection performance (BoN) and step-error localization (PROCESSBENCH) (Sections 2–4).

3.2 Big-picture architecture (diagram in words)¶

Query & gold answer pool → a dataset of math problems with known correct final answers (Section 2.1 mentions “~500,000 queries with golden answers” for preliminary trials).
Solution generation (policy models) → generate multiple candidate solutions per query (Section 2.1: 6–8 responses by mixing Qwen2-Math-Instruct and Qwen2.5-Math-Instruct, sizes 7B/72B).
Step segmentation → split each solution into steps using delimiter "\n\n" (Section 2.1).
Step labeling
MC estimation branch: from each step, sample 8 completions and label step based on whether completions reach the correct final answer (Section 2.1).
LLM-as-a-judge branch: a large critic model verifies steps directly and stops at the first error (Section 3.1.2; Appendix C prompt template).
Consensus filtering → keep only examples where MC and judge agree on error-step locations (Section 3.1.3; Figure 2; Section 4.1).
PRM training → initialize from Qwen2.5-Math-(7B/72B)-Instruct, replace LM head with a scalar head, and train with step-end supervision (Section 2.1; Section 4.1).
Evaluation
Response-level: BoN selection (prm@8) using aggregation of step scores (product/min/last) (Section 2.2; Section 3.2.4; Tables 6, 13–14).
Step-level: PROCESSBENCH error localization F1 and special subsets (Sections 2.2–2.3; Tables 2, 4, 7; Figure 7; Table 5).

3.3 Roadmap for the deep dive¶

I will explain (1) what PRMs are vs value models, because that distinction drives the MC-labeling critique (Section 3.1.1).
Then (2) how MC estimation labels steps and why it injects noise (Section 2.1; Section 3.1.2).
Then (3) how BoN is computed from step scores and why it can mis-measure process verification (Section 2.2; Sections 3.2.1–3.2.4).
Then (4) the proposed consensus filtering mechanism and the final training recipe (Section 3.1.3; Section 4.1; Figure 2).
Finally (5) the evaluation results across BoN and PROCESSBENCH showing the trade-offs and improvements (Tables 6–7; Figures 1–4, 7).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical systems-and-methodology paper whose core idea is that training data and evaluation choices can cause PRMs to drift away from step verification, and that combining MC-based labels with LLM-as-a-judge via agreement filtering yields higher-quality step supervision (Sections 3–4).

3.4.1 What is a PRM, and why “PRM vs value model” matters here¶

A Process Reward Model (PRM) is a model that scores the correctness of the current reasoning step in a multi-step solution, with the intended meaning that “this step is logically/mathematically valid given prior steps” (Section 3.1.1).
A value model instead estimates the probability of eventually reaching a correct final answer from the current partial solution (Section 3.1.1).
The paper’s key technical concern is that MC estimation labels steps by looking at future outcomes (whether completions reach the right final answer), which aligns more with a value estimate than a deterministic step correctness signal (Section 3.1.1). This mismatch can hurt step-error localization generalization, even if it helps answer selection under certain scoring schemes (Sections 3.1.2–3.1.5; Figure 9).

3.4.2 MC estimation data synthesis: how labels are constructed¶

Based on the preliminary training setup (Section 2.1) and later scaling experiments (Section 3.1.4; Section 4.1), the MC labeling pipeline works as follows:

For each query with a golden answer, generate multiple candidate solutions (Section 2.1: 6–8 responses).
Split each solution into steps using "\n\n" as the step delimiter (Section 2.1).
For each step s_i, run K = 8 independent completions starting from that step using a completion model (Section 2.1; Section 3.1.4 also uses 8 completions).
Convert those outcomes into labels:
Hard label (binary): the step is labeled correct if any of the 8 completions leads to the correct final answer; otherwise incorrect (Section 2.1; Section 3.1.4; Section 3.1.4’s threshold study in Figure 5 indicates threshold 0 is best).
Soft label (continuous): label is the fraction of completions that reach the correct final answer (Section 2.1; Section 3.1.4).
The paper removes all steps after the first step labeled incorrect (“eliminated all steps subsequent to those labeled as incorrect”) to avoid training on downstream steps that depend on an earlier error (Section 2.1).

Why this can be noisy (the paper’s mechanism-level explanation): - A completion model can sometimes “repair” an incorrect step and still arrive at a correct final answer, causing MC labeling to incorrectly mark the bad step as “correct” (Section 1 abstract; Section 3.1.2–3.1.3). - Conversely, a correct step can be followed by a completion that makes a later mistake, causing MC labeling to wrongly penalize the correct step (same locations).

3.4.3 LLM-as-a-judge step labeling: what it does differently¶

The judge approach uses a strong LLM critic (the paper uses Qwen2.5-72B-Instruct as judge in Section 3.1.2, and Qwen2.5-Instruct-72B in Section 4.1) to verify each step directly rather than inferring correctness from future completion success.
The prompt template (Appendix C) instructs the judge to:
Analyze each paragraph/step sequentially,
Stop at the first detected error,
Output “Correct/Incorrect” after providing analysis up to the failure point.
Mechanistically, this is closer to the PRM’s target behavior: determine whether a step is valid “right now,” not whether it can be salvaged later (Sections 3.1.1–3.1.2).

3.4.4 Consensus filtering: the paper’s main training-data mechanism¶

The proposed mechanism is a data filtering rule that keeps only training instances where two independent labelers agree on where the error occurs.

Inputs:
Step-wise labels from MC estimation (8 completions per step) (Sections 2.1, 4.1).
Step-wise labels from LLM-as-a-judge critic (Appendix C; Sections 3.1.2, 4.1).
Filtering rule:
Keep an instance only when both methods agree on the error reasoning step locations (Section 3.1.3; reiterated in Section 4.1).
Empirical effect on data volume:
Figure 2 reports that after consensus filtering, “only approximately 40% of the data are preserved” (Figure 2; Section 3.1.3).
Intended effect:
Reduce MC label noise while still leveraging MC to expand data and the judge to improve correctness of error localization (Section 3.1.3; Section 3.1.5; Section 4.1).

3.4.5 PRM model training: architecture and objective (as specified)¶

Initialization:
PRMs are initialized from Qwen2.5-Math-7B-Instruct or Qwen2.5-Math-72B-Instruct (Section 2.1; Section 4.1).
Head replacement:
The language modeling head is replaced with a scalar-value head consisting of two linear layers (Section 2.1).
Supervision point:
Loss is computed “on the last tokens of each step” (Section 2.1) / “tokens at the end of each step” (Section 4.1).
Loss:
Cross-entropy loss for the binary hard-label setting (Sections 2.1, 4.1).
Mean squared error loss is used for regression with soft labels in preliminary trials (Section 2.1).
Missing training hyperparameters:
The excerpt does not specify optimizer type, learning rate, batch size, training tokens, context window, number of layers/hidden size/heads, or hardware/compute budget. Based on the provided content, those details are not available to report precisely.

3.4.6 How BoN scoring uses PRMs, and why aggregation matters (with a micro-example)¶

BoN setup: - Sample N = 8 candidate solutions from a policy model (Section 2.2). - Score each candidate with a PRM, then select the highest-scored candidate (Section 2.2).

Response-level score from step scores: - The paper describes a common method: multiply per-step scores (Section 2.2; also discussed in Section 3.2.3/3.2.4 and Tables 13–14). - It also analyzes alternative aggregations: product, minimum, and last step score (Section 3.2.4; Figure 9; Tables 13–14).

Worked micro-example (illustrative of the paper’s scoring rules, not new data): - Suppose a solution has 3 steps and a PRM outputs step correctness probabilities: - Step 1: p1 = 0.9, Step 2: p2 = 0.8, Step 3: p3 = 0.2. - Aggregations: - product score = 0.9 * 0.8 * 0.2 = 0.144. - min score = min(0.9, 0.8, 0.2) = 0.2. - last score = 0.2. - The paper argues the “right” aggregation depends on what the step score means: - If step scores are “probability this step is correct” (judge/human-style), product/min are sensible (Section 3.2.4). - If step scores are MC-style “probability of eventually getting the right answer from here,” then multiplying across steps is conceptually wrong because those probabilities are not independent; in that case, the last score is more coherent and empirically works better for MC-trained PRMs (Section 3.2.4; Figure 9).

3.4.7 Step-level evaluation via PROCESSBENCH: what is being measured¶

PROCESSBENCH evaluates whether a model can identify the first erroneous step in a solution, or decide that all steps are correct (Section 2.2; Table 2/Table 4/Table 7).
The paper reports per-dataset error/correct metrics and aggregated F1 (Tables 2, 4, 7).
It additionally creates a targeted subset: cases where the final answer is correct but the process contains an error, to test true process-verification ability (Section 3.2.2; Figure 7; Table 5).

4. Key Insights and Innovations¶

MC estimation labels do not reliably teach step correctness (and can behave like value learning).
Novelty: The paper separates “step correctness” (PRM) from “future success probability” (value) and argues MC estimation conflates them (Section 3.1.1).
Significance: This explains why MC-trained PRMs can look OK under BoN yet fail at step-error localization (Tables 3–4; Figure 7).
BoN-only evaluation is systematically biased for PRMs.
Novelty: The paper identifies three mechanisms:
- Policy models produce “answer-correct but process-wrong” solutions (Section 3.2.1; Figure 6).
- PRMs often tolerate these, inflating BoN (Section 3.2.2; Figure 7; Table 5).
- Optimization for BoN can shift PRMs toward scoring final answers more than intermediate steps (Section 3.2.3; Figure 8).
Significance: It motivates evaluating PRMs with step-level benchmarks in addition to BoN (Section 3.2.5).
Consensus filtering (MC ∩ judge agreement) as a simple, data-efficient quality control.
Novelty: Rather than choosing MC or judge labels, the method keeps only instances where both agree on error locations (Section 3.1.3; Section 4.1).
Significance: Figure 2 shows roughly 40% retention yet near-judge-level PROCESSBENCH performance:
- PROCESSBENCH mean F1: MC (40.1) vs judge (46.5) vs consensus (46.3) (Figure 2).
This is a fundamental “quality-over-quantity” finding for PRM data.
Hard labels beat soft labels once noise is reduced.
Novelty: The paper ties the soft-vs-hard outcome to noise levels and the deterministic nature of step correctness (Section 3.1.4).
Evidence/significance: After consensus filtering, hard labels substantially outperform soft labels on both BoN and PROCESSBENCH (Figures 3–4).
Scoring strategy must match label semantics (product/min vs last).
Novelty: The paper links aggregation choice to whether scores represent step correctness vs future success (Section 3.2.4).
Evidence: Figure 9 shows that for MC-trained PRMs, last scoring outperforms product/min, while the opposite trend holds for judge/human-labeled PRMs.

5. Experimental Analysis¶

5.1 Evaluation methodology¶

Policy model and sampling - BoN evaluation samples N = 8 candidate responses from Qwen2.5-Math-7B-Instruct (Section 2.2; Table 6). - Candidates are scored using step-score aggregation; the default described is the product of step scores (Section 2.2), with extensive comparisons to min and last later (Section 3.2.4; Tables 13–14).

BoN tasks / datasets - The excerpt lists: GSM8K, MATH, Minerva Math, GaoKao 2023 En, OlympiadBench, College Math, and MMLU STEM (Section 2.2; Table 6).

Step-level benchmark - PROCESSBENCH measures first-error localization and reports error/correct/F1 (Section 2.2; Tables 2, 4, 7).

Baselines compared - PRMs: Math-Shepherd-PRM-7B, RLHFlow-PRM-*, Skywork-PRM-*, EurusPRM-*, plus internal fine-tunes Qwen2.5-Math-7B-PRM800K and Qwen2.5-Math-7B-Math-Shepherd (Section 4.2; Tables 6–7). - ORMs: Qwen2.5-Math-RM-72B (Section 4.2; Table 6; Table 7 includes ORM-as-step scorer via decomposition). - LLM-as-judge baselines for PROCESSBENCH include proprietary and open-source critics (Section 4.2; Table 7).

5.2 Main quantitative results (with specific numbers)¶

(A) Preliminary trials: MC-trained PRMs fail on PROCESSBENCH despite lots of data¶

From Tables 1–2 (Section 2.3):

BoN (Avg. over 7 tasks):
maj@8: 66.2
Qwen2.5-Math-7B-PRM800K: 64.9
Qwen2.5-Math-7B-PRM-MC-hard: 65.5
Qwen2.5-Math-7B-PRM-MC-soft: 64.4
(Table 1 shows none exceeds maj@8.)
PROCESSBENCH (Avg. F1):
PRM800K-trained PRM: 56.5
MC-hard: 40.2
MC-soft: 40.2
(Table 2)

This establishes the paper’s core empirical complaint: MC labeling can look “not terrible” in BoN but collapses in step-error localization.

(B) Comparing label sources: MC vs judge vs human shows BoN/PROCESSBENCH inversion¶

From Tables 3–4 (Section 3.1.2):

BoN average accuracy (Best-of-8):
MC (our data, 860k): 65.9 (best among the four)
LLM-as-judge (860k): 65.3
Human annotation PRM800K (264k): 64.9
(Table 3)
PROCESSBENCH Avg. F1:
Human annotation PRM800K: 56.5 (best)
LLM-as-judge (860k): 46.5
MC (our data, 860k): 40.1
MC (Math-Shepherd, 440k): 28.9
(Table 4)

The paper highlights that MC vs human can reverse ordering depending on whether you measure answer-selection (BoN) or step verification (PROCESSBENCH) (Section 3.1.2; also visualized in Figure 7).

(C) Consensus filtering improves data efficiency and PROCESSBENCH¶

Figure 2 (Section 3.1.3) reports:

PROCESSBENCH mean F1: MC (40.1) vs LLM-as-judge (46.5) vs Consensus Filtering (~350k) (46.3)

So the filtered dataset (about 40% retained) nearly matches the judge-trained model on step-error localization.

(D) Hard vs soft labels: hard becomes much better after filtering¶

Figures 3–4 (Section 3.1.4) show:

Before filtering (3M), BoN is about the same for soft vs hard (Figure 3 shows both around 65.4).
After filtering (1.5M), hard labels jump to 67.2 on BoN while soft labels stay around 65.4 (Figure 3).
On PROCESSBENCH, after filtering, hard labels reach 66.5 vs soft labels 49.3 (Figure 4).

This supports the claim that once label noise is controlled, deterministic hard labels are superior for step correctness (Section 3.1.4).

(E) BoN bias evidence: “answer-correct but process-wrong” becomes common as problems get harder¶

Figure 6 (Section 3.2.1) reports process error rates among correct-answer responses sampled from the policy model:

GSM8K 5.1%, MATH 11.9%, OlympiadBench 27.4%, Omni-MATH 43.4%

This is the empirical basis for BoN/PRM-objective misalignment: BoN rewards correct answers, but PRMs are supposed to penalize flawed steps even if the final answer is right (Section 3.2.1).

(F) Existing PRMs struggle on “answer-correct but process-wrong” detection¶

Table 5 (Section 3.2.2) evaluates detection accuracy on PROCESSBENCH cases with correct answers but erroneous processes. Many open PRMs are below 50% average accuracy, while the paper’s released PRMs are higher:

Qwen2.5-Math-PRM-7B: Avg 53.9; Qwen2.5-Math-PRM-72B: Avg 58.1 (Table 5)

(G) Final released PRMs are strongest among compared open PRMs on both BoN and PROCESSBENCH¶

BoN (policy: Qwen2.5-Math-7B-Instruct), Table 6:
maj@8: 66.2
Qwen2.5-Math-PRM-7B: 67.6
Qwen2.5-Math-PRM-72B: 69.3
Qwen2.5-Math-RM-72B (ORM): 68.9
The 7B PRM beats maj@8 and other ~7B PRMs on average; the 72B PRM slightly exceeds the 72B ORM on average, with notable gains on some tasks like Minerva Math and MMLU STEM as described in Section 4.3.
PROCESSBENCH, Table 7:
Qwen2.5-Math-PRM-7B: Avg F1 73.5
Qwen2.5-Math-PRM-72B: Avg F1 78.3
Next strongest open PRM baseline shown is much lower (e.g., Qwen2.5-Math-7B-PRM800K: 56.5) (Table 7).

5.3 Do experiments support the claims?¶

Support for “MC estimation is inferior for step verification”: Strongly supported by PROCESSBENCH comparisons (Tables 2, 4) and by the extracted subset analysis (Figure 7; Table 5).
Support for “BoN alone is biased”: Supported by direct measurement of answer-correct/process-wrong prevalence (Figure 6), by the inversion trend across BoN vs extracted PROCESSBENCH (Figure 7), and by evidence of final-step minimum-score concentration (Figure 8).
Support for “consensus filtering helps”: Supported by Figure 2 (near-judge-level F1 with ~40% data) and by the later hard/soft filtered results (Figures 3–4).

5.4 Ablations / robustness checks / extra analyses present¶

Label threshold sweep for MC hard-labeling (Figure 5): performance degrades as threshold increases; best is threshold 0 (treat any successful completion as positive) (Section 3.1.4).
Aggregation strategy comparison (last vs product vs min) across PRMs and data sources (Section 3.2.4; Figure 9; Tables 13–14).
Larger N (Best-of-64) and additional tasks (Appendix B.4; Table 10).
Policy model scaling: BoN with Qwen2.5-Math-72B-Instruct as policy (Appendix B.1; Table 9).
Chinese benchmarks (Appendix B.3; Tables 15–16).
PRM-guided greedy search comparison to ORM BoN (Appendix A; Table 8).

6. Limitations and Trade-offs¶

Incomplete reporting of core training hyperparameters
The provided content does not specify optimizer, learning rate schedule, batch size, training length/tokens, context window, model architecture dimensions, or hardware/compute. This makes exact reproduction harder from the excerpt alone (Sections 2.1 and 4.1 describe objectives and labeling, but not those hyperparameters).
Dependence on a strong LLM judge
The consensus filtering pipeline relies on a large critic model (Qwen2.5-72B-Instruct / Qwen2.5-Instruct-72B) for step verification (Section 3.1.2; Section 4.1). This introduces compute/cost and potential judge bias as a practical constraint.
Data efficiency vs coverage trade-off
Consensus filtering keeps only ~40% of examples (Figure 2), which improves label quality but may discard hard-but-informative disagreements (Section 3.1.3). The paper shows the trade-off is beneficial empirically, but the failure modes of discarded examples are not deeply analyzed in the excerpt.
BoN upper bound gap remains
The paper explicitly notes a “considerable performance gap” to pass@8 (Limitation paragraph near the end of Section 6). For example, Table 6 shows pass@8 average 74.7 vs Qwen2.5-Math-PRM-72B 69.3.
PRM usage in RL is not explored
The paper lists “best practices for utilizing PRMs in reinforcement learning remain unexplored” (Limitation paragraph in Section 6).
Human annotation integration is underexplored
The paper notes that efficiently using existing high-quality human annotation is still largely underexplored and suggests weak supervision to expand datasets (Limitation paragraph in Section 6).

7. Implications and Future Directions¶

How this changes the landscape (within the paper’s scope)
It reframes PRM development as jointly a data-labeling and evaluation-design problem: strong BoN numbers can hide weak process verification (Sections 3.1–3.2; Figure 7).
It provides empirical evidence that step-level benchmarks like PROCESSBENCH are necessary to prevent PRMs from becoming outcome-oriented (Section 3.2.3–3.2.5; Figure 8; Table 7).
Follow-up research directions suggested by the paper
Better ways to use PRMs in reinforcement learning (explicitly listed as future work in the Limitation paragraph in Section 6).
Better use of existing human-annotated data, potentially via weakly supervised expansion (Limitation paragraph in Section 6).
Improved search strategies combining PRMs and value models: Appendix A argues greedy step-wise choice can be locally optimal but globally wrong, and suggests DFS/backtracking or combining reward (step correctness) and value (future success probability) (Appendix A; Table 8).
Practical applications / downstream use cases
Answer selection: Using PRMs to choose among multiple sampled solutions (BoN), while being cautious that BoN can mis-measure process quality (Section 3.2).
Error localization / tutoring / debugging: Step-level detection of the first incorrect step, which is directly measured by PROCESSBENCH and where the released PRMs are strongest among compared open PRMs (Table 7).
Search guidance: PRM-guided greedy search is explored, though improvements over ORM BoN are small in their reported setup (Appendix A; Table 8).
Repro/Integration Guidance (based on the provided paper)
If your goal is true step verification, prefer:
- Training data closer to deterministic step correctness (human annotation or judge labeling), or
- The paper’s consensus filtering approach to clean MC-expanded data (Section 3.1.3; Figure 2; Section 4.1).
- Evaluation that includes step-level metrics like PROCESSBENCH, not only BoN (Section 3.2.5; Table 7).
If you must use MC estimation for labeling, the paper’s guidance is:
- Use hard labels rather than soft labels after noise control (Section 3.1.4; Figures 3–4).
- Use threshold 0 for hard labels (a step is positive if any completion succeeds) (Section 3.1.4; Figure 5).
- In BoN, consider last-score aggregation for MC-trained PRMs because it better matches MC semantics (Section 3.2.4; Figure 9).
If you train PRMs with judge/human-style step correctness semantics, product or min aggregation is more appropriate for BoN in their findings (Section 3.2.4; Figure 9; Tables 13–14).