The Lessons of Developing Process Reward Models in Mathematical Reasoning¶
ArXiv: 2501.07301
đŻ Pitch¶
The paper diagnoses why common Monte Carlo (MC) estimation for step-level labeling and Best-ofâN (BoN) evaluation mislead Process Reward Models (PRMs), showing MC labels behave like value estimates and cause PRMs to tolerate outcomeâcorrect but processâflawed solutions. It introduces a simple consensus filtering pipeline that combines MC estimation with LLM-as-a-judge and demonstrates substantially improved stepâerror localization on PROCESSBENCH while preserving strong BoN selection performanceâmaking PRMs more reliable for processâaware verification and downstream selection/search.
1. Executive Summary (2-3 sentences)¶
This paper analyzes why many existing Process Reward Models (PRMs) for math reasoning underperform at their intended jobâverifying intermediate reasoning stepsâdespite looking strong under the common Best-of-N (BoN) selection evaluation. It shows that popular Monte Carlo (MC) estimation-based step labeling can be noisy and can push PRMs toward outcome-like behavior, and it proposes a simple consensus filtering pipeline combining MC labeling with LLM-as-a-judge step verification that improves step-error localization on PROCESSBENCH while maintaining strong BoN results (Sections 3â4; Figures 2â4; Tables 6â7).
2. Context and Motivation¶
- Problem / gap addressed
- Large Language Models (LLMs) can reach correct final answers while containing wrong or fabricated intermediate reasoning steps, which undermines trust in the reasoning process (Introduction, Section 1; Figure 6).
-
PRMs aim to score each step in a solution so that systems can detect where reasoning first goes wrong and use those signals for selection or search (Section 1; Section 2.2 âPROCESSBENCHâ).
-
Why this matters
- If an evaluation only checks final answers, models can be rewarded for âanswer-correct but reasoning-wrongâ behavior (Section 3.2.1; Figure 6).
-
In downstream usage (e.g., selecting among multiple sampled solutions), a PRM is supposed to favor solutions whose process is correct, not just the outcome (Sections 1, 3.2).
-
Prior approaches and where they fall short
- Human step annotation can produce high-quality supervision but is expensive (Section 1; Section 3.1.2 uses
PRM800Kas the human-annotated reference). - MC estimation labels a step by sampling multiple completions from that step and checking how often the final answer is correct (Section 2.1; Section 3.1). The paper argues this is fundamentally closer to a value estimate (âprobability of eventual successâ) than a deterministic step correctness label (Section 3.1.1).
-
BoN evaluation is widely used to measure PRM utility by picking the best of
Ncandidate solutions scored by the PRM (Section 2.2). The paper shows BoN can be biased because (i) policy models often produce correct answers with flawed processes, and (ii) PRMs may tolerate these, inflating BoN scores (Section 3.2.1â3.2.2; Figure 7; Table 5). -
How the paper positions itself
- It is an empirical âlessons learnedâ study backed by extensive comparisons across labeling methods (MC vs LLM-as-judge vs human), evaluation types (BoN vs PROCESSBENCH), and model scales (7B and 72B) (Sections 2â4; Tables 1â7; Figures 1â9).
- It proposes a lightweight remedyâ
consensus filteringâand releases PRMs trained with that data pipeline (Section 4; Figure 2; Tables 6â7).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a training-and-evaluation pipeline for a model (
PRM) that outputs a correctness score for each reasoning step in a math solution. - It solves âhow do we build and evaluate step-verifiers for math reasoningâ by (i) generating candidate solutions, (ii) assigning step labels using automated signals, (iii) training a step classifier, and (iv) evaluating both answer-selection performance (BoN) and step-error localization (
PROCESSBENCH) (Sections 2â4).
3.2 Big-picture architecture (diagram in words)¶
- Query & gold answer pool â a dataset of math problems with known correct final answers (Section 2.1 mentions â~500,000 queries with golden answersâ for preliminary trials).
- Solution generation (policy models) â generate multiple candidate solutions per query (Section 2.1: 6â8 responses by mixing
Qwen2-Math-InstructandQwen2.5-Math-Instruct, sizes 7B/72B). - Step segmentation â split each solution into steps using delimiter
"\n\n"(Section 2.1). - Step labeling
- MC estimation branch: from each step, sample 8 completions and label step based on whether completions reach the correct final answer (Section 2.1).
- LLM-as-a-judge branch: a large critic model verifies steps directly and stops at the first error (Section 3.1.2; Appendix C prompt template).
- Consensus filtering â keep only examples where MC and judge agree on error-step locations (Section 3.1.3; Figure 2; Section 4.1).
- PRM training â initialize from
Qwen2.5-Math-(7B/72B)-Instruct, replace LM head with a scalar head, and train with step-end supervision (Section 2.1; Section 4.1). - Evaluation
- Response-level: BoN selection (
prm@8) using aggregation of step scores (product/min/last) (Section 2.2; Section 3.2.4; Tables 6, 13â14). - Step-level:
PROCESSBENCHerror localization F1 and special subsets (Sections 2.2â2.3; Tables 2, 4, 7; Figure 7; Table 5).
3.3 Roadmap for the deep dive¶
- I will explain (1) what PRMs are vs value models, because that distinction drives the MC-labeling critique (Section 3.1.1).
- Then (2) how MC estimation labels steps and why it injects noise (Section 2.1; Section 3.1.2).
- Then (3) how BoN is computed from step scores and why it can mis-measure process verification (Section 2.2; Sections 3.2.1â3.2.4).
- Then (4) the proposed consensus filtering mechanism and the final training recipe (Section 3.1.3; Section 4.1; Figure 2).
- Finally (5) the evaluation results across BoN and PROCESSBENCH showing the trade-offs and improvements (Tables 6â7; Figures 1â4, 7).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical systems-and-methodology paper whose core idea is that training data and evaluation choices can cause PRMs to drift away from step verification, and that combining MC-based labels with LLM-as-a-judge via agreement filtering yields higher-quality step supervision (Sections 3â4).
3.4.1 What is a PRM, and why âPRM vs value modelâ matters here¶
- A
Process Reward Model (PRM)is a model that scores the correctness of the current reasoning step in a multi-step solution, with the intended meaning that âthis step is logically/mathematically valid given prior stepsâ (Section 3.1.1). - A
value modelinstead estimates the probability of eventually reaching a correct final answer from the current partial solution (Section 3.1.1). - The paperâs key technical concern is that
MC estimationlabels steps by looking at future outcomes (whether completions reach the right final answer), which aligns more with a value estimate than a deterministic step correctness signal (Section 3.1.1). This mismatch can hurt step-error localization generalization, even if it helps answer selection under certain scoring schemes (Sections 3.1.2â3.1.5; Figure 9).
3.4.2 MC estimation data synthesis: how labels are constructed¶
Based on the preliminary training setup (Section 2.1) and later scaling experiments (Section 3.1.4; Section 4.1), the MC labeling pipeline works as follows:
- For each query with a golden answer, generate multiple candidate solutions (Section 2.1: 6â8 responses).
- Split each solution into steps using
"\n\n"as the step delimiter (Section 2.1). - For each step
s_i, runK = 8independent completions starting from that step using a completion model (Section 2.1; Section 3.1.4 also uses 8 completions). - Convert those outcomes into labels:
- Hard label (binary): the step is labeled correct if any of the 8 completions leads to the correct final answer; otherwise incorrect (Section 2.1; Section 3.1.4; Section 3.1.4âs threshold study in Figure 5 indicates threshold
0is best). - Soft label (continuous): label is the fraction of completions that reach the correct final answer (Section 2.1; Section 3.1.4).
- The paper removes all steps after the first step labeled incorrect (âeliminated all steps subsequent to those labeled as incorrectâ) to avoid training on downstream steps that depend on an earlier error (Section 2.1).
Why this can be noisy (the paperâs mechanism-level explanation): - A completion model can sometimes ârepairâ an incorrect step and still arrive at a correct final answer, causing MC labeling to incorrectly mark the bad step as âcorrectâ (Section 1 abstract; Section 3.1.2â3.1.3). - Conversely, a correct step can be followed by a completion that makes a later mistake, causing MC labeling to wrongly penalize the correct step (same locations).
3.4.3 LLM-as-a-judge step labeling: what it does differently¶
- The judge approach uses a strong LLM critic (the paper uses
Qwen2.5-72B-Instructas judge in Section 3.1.2, andQwen2.5-Instruct-72Bin Section 4.1) to verify each step directly rather than inferring correctness from future completion success. - The prompt template (Appendix C) instructs the judge to:
- Analyze each paragraph/step sequentially,
- Stop at the first detected error,
- Output âCorrect/Incorrectâ after providing analysis up to the failure point.
- Mechanistically, this is closer to the PRMâs target behavior: determine whether a step is valid âright now,â not whether it can be salvaged later (Sections 3.1.1â3.1.2).
3.4.4 Consensus filtering: the paperâs main training-data mechanism¶
The proposed mechanism is a data filtering rule that keeps only training instances where two independent labelers agree on where the error occurs.
- Inputs:
- Step-wise labels from MC estimation (8 completions per step) (Sections 2.1, 4.1).
- Step-wise labels from LLM-as-a-judge critic (Appendix C; Sections 3.1.2, 4.1).
- Filtering rule:
- Keep an instance only when both methods agree on the error reasoning step locations (Section 3.1.3; reiterated in Section 4.1).
- Empirical effect on data volume:
- Figure 2 reports that after consensus filtering, âonly approximately 40% of the data are preservedâ (Figure 2; Section 3.1.3).
- Intended effect:
- Reduce MC label noise while still leveraging MC to expand data and the judge to improve correctness of error localization (Section 3.1.3; Section 3.1.5; Section 4.1).
3.4.5 PRM model training: architecture and objective (as specified)¶
- Initialization:
- PRMs are initialized from
Qwen2.5-Math-7B-InstructorQwen2.5-Math-72B-Instruct(Section 2.1; Section 4.1). - Head replacement:
- The language modeling head is replaced with a scalar-value head consisting of two linear layers (Section 2.1).
- Supervision point:
- Loss is computed âon the last tokens of each stepâ (Section 2.1) / âtokens at the end of each stepâ (Section 4.1).
- Loss:
- Cross-entropy loss for the binary hard-label setting (Sections 2.1, 4.1).
- Mean squared error loss is used for regression with soft labels in preliminary trials (Section 2.1).
- Missing training hyperparameters:
- The excerpt does not specify optimizer type, learning rate, batch size, training tokens, context window, number of layers/hidden size/heads, or hardware/compute budget. Based on the provided content, those details are not available to report precisely.
3.4.6 How BoN scoring uses PRMs, and why aggregation matters (with a micro-example)¶
BoN setup:
- Sample N = 8 candidate solutions from a policy model (Section 2.2).
- Score each candidate with a PRM, then select the highest-scored candidate (Section 2.2).
Response-level score from step scores:
- The paper describes a common method: multiply per-step scores (Section 2.2; also discussed in Section 3.2.3/3.2.4 and Tables 13â14).
- It also analyzes alternative aggregations: product, minimum, and last step score (Section 3.2.4; Figure 9; Tables 13â14).
Worked micro-example (illustrative of the paperâs scoring rules, not new data):
- Suppose a solution has 3 steps and a PRM outputs step correctness probabilities:
- Step 1: p1 = 0.9, Step 2: p2 = 0.8, Step 3: p3 = 0.2.
- Aggregations:
- product score = 0.9 * 0.8 * 0.2 = 0.144.
- min score = min(0.9, 0.8, 0.2) = 0.2.
- last score = 0.2.
- The paper argues the ârightâ aggregation depends on what the step score means:
- If step scores are âprobability this step is correctâ (judge/human-style), product/min are sensible (Section 3.2.4).
- If step scores are MC-style âprobability of eventually getting the right answer from here,â then multiplying across steps is conceptually wrong because those probabilities are not independent; in that case, the last score is more coherent and empirically works better for MC-trained PRMs (Section 3.2.4; Figure 9).
3.4.7 Step-level evaluation via PROCESSBENCH: what is being measured¶
PROCESSBENCHevaluates whether a model can identify the first erroneous step in a solution, or decide that all steps are correct (Section 2.2; Table 2/Table 4/Table 7).- The paper reports per-dataset error/correct metrics and aggregated F1 (Tables 2, 4, 7).
- It additionally creates a targeted subset: cases where the final answer is correct but the process contains an error, to test true process-verification ability (Section 3.2.2; Figure 7; Table 5).
4. Key Insights and Innovations¶
- MC estimation labels do not reliably teach step correctness (and can behave like value learning).
- Novelty: The paper separates âstep correctnessâ (PRM) from âfuture success probabilityâ (value) and argues MC estimation conflates them (Section 3.1.1).
-
Significance: This explains why MC-trained PRMs can look OK under BoN yet fail at step-error localization (Tables 3â4; Figure 7).
-
BoN-only evaluation is systematically biased for PRMs.
- Novelty: The paper identifies three mechanisms:
- Policy models produce âanswer-correct but process-wrongâ solutions (Section 3.2.1; Figure 6).
- PRMs often tolerate these, inflating BoN (Section 3.2.2; Figure 7; Table 5).
- Optimization for BoN can shift PRMs toward scoring final answers more than intermediate steps (Section 3.2.3; Figure 8).
-
Significance: It motivates evaluating PRMs with step-level benchmarks in addition to BoN (Section 3.2.5).
-
Consensus filtering (MC â© judge agreement) as a simple, data-efficient quality control.
- Novelty: Rather than choosing MC or judge labels, the method keeps only instances where both agree on error locations (Section 3.1.3; Section 4.1).
- Significance: Figure 2 shows roughly 40% retention yet near-judge-level PROCESSBENCH performance:
- PROCESSBENCH mean F1: MC (40.1) vs judge (46.5) vs consensus (46.3) (Figure 2).
-
This is a fundamental âquality-over-quantityâ finding for PRM data.
-
Hard labels beat soft labels once noise is reduced.
- Novelty: The paper ties the soft-vs-hard outcome to noise levels and the deterministic nature of step correctness (Section 3.1.4).
-
Evidence/significance: After consensus filtering, hard labels substantially outperform soft labels on both BoN and PROCESSBENCH (Figures 3â4).
-
Scoring strategy must match label semantics (product/min vs last).
- Novelty: The paper links aggregation choice to whether scores represent step correctness vs future success (Section 3.2.4).
- Evidence: Figure 9 shows that for MC-trained PRMs,
lastscoring outperformsproduct/min, while the opposite trend holds for judge/human-labeled PRMs.
5. Experimental Analysis¶
5.1 Evaluation methodology¶
Policy model and sampling
- BoN evaluation samples N = 8 candidate responses from Qwen2.5-Math-7B-Instruct (Section 2.2; Table 6).
- Candidates are scored using step-score aggregation; the default described is the product of step scores (Section 2.2), with extensive comparisons to min and last later (Section 3.2.4; Tables 13â14).
BoN tasks / datasets
- The excerpt lists: GSM8K, MATH, Minerva Math, GaoKao 2023 En, OlympiadBench, College Math, and MMLU STEM (Section 2.2; Table 6).
Step-level benchmark
- PROCESSBENCH measures first-error localization and reports error/correct/F1 (Section 2.2; Tables 2, 4, 7).
Baselines compared
- PRMs: Math-Shepherd-PRM-7B, RLHFlow-PRM-*, Skywork-PRM-*, EurusPRM-*, plus internal fine-tunes Qwen2.5-Math-7B-PRM800K and Qwen2.5-Math-7B-Math-Shepherd (Section 4.2; Tables 6â7).
- ORMs: Qwen2.5-Math-RM-72B (Section 4.2; Table 6; Table 7 includes ORM-as-step scorer via decomposition).
- LLM-as-judge baselines for PROCESSBENCH include proprietary and open-source critics (Section 4.2; Table 7).
5.2 Main quantitative results (with specific numbers)¶
(A) Preliminary trials: MC-trained PRMs fail on PROCESSBENCH despite lots of data¶
From Tables 1â2 (Section 2.3):
- BoN (Avg. over 7 tasks):
maj@8: 66.2Qwen2.5-Math-7B-PRM800K: 64.9Qwen2.5-Math-7B-PRM-MC-hard: 65.5-
Qwen2.5-Math-7B-PRM-MC-soft: 64.4
(Table 1 shows none exceedsmaj@8.) -
PROCESSBENCH (Avg. F1):
PRM800K-trained PRM: 56.5- MC-hard: 40.2
- MC-soft: 40.2
(Table 2)
This establishes the paperâs core empirical complaint: MC labeling can look ânot terribleâ in BoN but collapses in step-error localization.
(B) Comparing label sources: MC vs judge vs human shows BoN/PROCESSBENCH inversion¶
From Tables 3â4 (Section 3.1.2):
- BoN average accuracy (Best-of-8):
- MC (our data, 860k): 65.9 (best among the four)
- LLM-as-judge (860k): 65.3
-
Human annotation PRM800K (264k): 64.9
(Table 3) -
PROCESSBENCH Avg. F1:
- Human annotation PRM800K: 56.5 (best)
- LLM-as-judge (860k): 46.5
- MC (our data, 860k): 40.1
- MC (Math-Shepherd, 440k): 28.9
(Table 4)
The paper highlights that MC vs human can reverse ordering depending on whether you measure answer-selection (BoN) or step verification (PROCESSBENCH) (Section 3.1.2; also visualized in Figure 7).
(C) Consensus filtering improves data efficiency and PROCESSBENCH¶
Figure 2 (Section 3.1.3) reports:
PROCESSBENCH mean F1: MC (40.1) vs LLM-as-judge (46.5) vs Consensus Filtering (~350k) (46.3)
So the filtered dataset (about 40% retained) nearly matches the judge-trained model on step-error localization.
(D) Hard vs soft labels: hard becomes much better after filtering¶
Figures 3â4 (Section 3.1.4) show:
- Before filtering (3M), BoN is about the same for soft vs hard (Figure 3 shows both around 65.4).
- After filtering (1.5M), hard labels jump to 67.2 on BoN while soft labels stay around 65.4 (Figure 3).
- On PROCESSBENCH, after filtering, hard labels reach 66.5 vs soft labels 49.3 (Figure 4).
This supports the claim that once label noise is controlled, deterministic hard labels are superior for step correctness (Section 3.1.4).
(E) BoN bias evidence: âanswer-correct but process-wrongâ becomes common as problems get harder¶
Figure 6 (Section 3.2.1) reports process error rates among correct-answer responses sampled from the policy model:
GSM8K 5.1%, MATH 11.9%, OlympiadBench 27.4%, Omni-MATH 43.4%
This is the empirical basis for BoN/PRM-objective misalignment: BoN rewards correct answers, but PRMs are supposed to penalize flawed steps even if the final answer is right (Section 3.2.1).
(F) Existing PRMs struggle on âanswer-correct but process-wrongâ detection¶
Table 5 (Section 3.2.2) evaluates detection accuracy on PROCESSBENCH cases with correct answers but erroneous processes. Many open PRMs are below 50% average accuracy, while the paperâs released PRMs are higher:
Qwen2.5-Math-PRM-7B: Avg 53.9;Qwen2.5-Math-PRM-72B: Avg 58.1 (Table 5)
(G) Final released PRMs are strongest among compared open PRMs on both BoN and PROCESSBENCH¶
- BoN (policy: Qwen2.5-Math-7B-Instruct), Table 6:
maj@8: 66.2Qwen2.5-Math-PRM-7B: 67.6Qwen2.5-Math-PRM-72B: 69.3-
Qwen2.5-Math-RM-72B(ORM): 68.9
The 7B PRM beatsmaj@8and other ~7B PRMs on average; the 72B PRM slightly exceeds the 72B ORM on average, with notable gains on some tasks likeMinerva MathandMMLU STEMas described in Section 4.3. -
PROCESSBENCH, Table 7:
Qwen2.5-Math-PRM-7B: Avg F1 73.5Qwen2.5-Math-PRM-72B: Avg F1 78.3- Next strongest open PRM baseline shown is much lower (e.g.,
Qwen2.5-Math-7B-PRM800K: 56.5) (Table 7).
5.3 Do experiments support the claims?¶
- Support for âMC estimation is inferior for step verificationâ: Strongly supported by PROCESSBENCH comparisons (Tables 2, 4) and by the extracted subset analysis (Figure 7; Table 5).
- Support for âBoN alone is biasedâ: Supported by direct measurement of answer-correct/process-wrong prevalence (Figure 6), by the inversion trend across BoN vs extracted PROCESSBENCH (Figure 7), and by evidence of final-step minimum-score concentration (Figure 8).
- Support for âconsensus filtering helpsâ: Supported by Figure 2 (near-judge-level F1 with ~40% data) and by the later hard/soft filtered results (Figures 3â4).
5.4 Ablations / robustness checks / extra analyses present¶
- Label threshold sweep for MC hard-labeling (Figure 5): performance degrades as threshold increases; best is threshold
0(treat any successful completion as positive) (Section 3.1.4). - Aggregation strategy comparison (
lastvsproductvsmin) across PRMs and data sources (Section 3.2.4; Figure 9; Tables 13â14). - Larger N (Best-of-64) and additional tasks (Appendix B.4; Table 10).
- Policy model scaling: BoN with
Qwen2.5-Math-72B-Instructas policy (Appendix B.1; Table 9). - Chinese benchmarks (Appendix B.3; Tables 15â16).
- PRM-guided greedy search comparison to ORM BoN (Appendix A; Table 8).
6. Limitations and Trade-offs¶
- Incomplete reporting of core training hyperparameters
-
The provided content does not specify optimizer, learning rate schedule, batch size, training length/tokens, context window, model architecture dimensions, or hardware/compute. This makes exact reproduction harder from the excerpt alone (Sections 2.1 and 4.1 describe objectives and labeling, but not those hyperparameters).
-
Dependence on a strong LLM judge
-
The consensus filtering pipeline relies on a large critic model (
Qwen2.5-72B-Instruct/Qwen2.5-Instruct-72B) for step verification (Section 3.1.2; Section 4.1). This introduces compute/cost and potential judge bias as a practical constraint. -
Data efficiency vs coverage trade-off
-
Consensus filtering keeps only ~40% of examples (Figure 2), which improves label quality but may discard hard-but-informative disagreements (Section 3.1.3). The paper shows the trade-off is beneficial empirically, but the failure modes of discarded examples are not deeply analyzed in the excerpt.
-
BoN upper bound gap remains
-
The paper explicitly notes a âconsiderable performance gapâ to
pass@8(Limitation paragraph near the end of Section 6). For example, Table 6 showspass@8average 74.7 vsQwen2.5-Math-PRM-72B69.3. -
PRM usage in RL is not explored
-
The paper lists âbest practices for utilizing PRMs in reinforcement learning remain unexploredâ (Limitation paragraph in Section 6).
-
Human annotation integration is underexplored
- The paper notes that efficiently using existing high-quality human annotation is still largely underexplored and suggests weak supervision to expand datasets (Limitation paragraph in Section 6).
7. Implications and Future Directions¶
- How this changes the landscape (within the paperâs scope)
- It reframes PRM development as jointly a data-labeling and evaluation-design problem: strong BoN numbers can hide weak process verification (Sections 3.1â3.2; Figure 7).
-
It provides empirical evidence that step-level benchmarks like
PROCESSBENCHare necessary to prevent PRMs from becoming outcome-oriented (Section 3.2.3â3.2.5; Figure 8; Table 7). -
Follow-up research directions suggested by the paper
- Better ways to use PRMs in reinforcement learning (explicitly listed as future work in the Limitation paragraph in Section 6).
- Better use of existing human-annotated data, potentially via weakly supervised expansion (Limitation paragraph in Section 6).
-
Improved search strategies combining PRMs and value models: Appendix A argues greedy step-wise choice can be locally optimal but globally wrong, and suggests DFS/backtracking or combining reward (step correctness) and value (future success probability) (Appendix A; Table 8).
-
Practical applications / downstream use cases
- Answer selection: Using PRMs to choose among multiple sampled solutions (BoN), while being cautious that BoN can mis-measure process quality (Section 3.2).
- Error localization / tutoring / debugging: Step-level detection of the first incorrect step, which is directly measured by PROCESSBENCH and where the released PRMs are strongest among compared open PRMs (Table 7).
-
Search guidance: PRM-guided greedy search is explored, though improvements over ORM BoN are small in their reported setup (Appendix A; Table 8).
-
Repro/Integration Guidance (based on the provided paper)
- If your goal is true step verification, prefer:
- Training data closer to deterministic step correctness (human annotation or judge labeling), or
- The paperâs
consensus filteringapproach to clean MC-expanded data (Section 3.1.3; Figure 2; Section 4.1). - Evaluation that includes step-level metrics like PROCESSBENCH, not only BoN (Section 3.2.5; Table 7).
- If you must use MC estimation for labeling, the paperâs guidance is:
- Use hard labels rather than soft labels after noise control (Section 3.1.4; Figures 3â4).
- Use threshold
0for hard labels (a step is positive if any completion succeeds) (Section 3.1.4; Figure 5). - In BoN, consider
last-score aggregation for MC-trained PRMs because it better matches MC semantics (Section 3.2.4; Figure 9).
- If you train PRMs with judge/human-style step correctness semantics,
productorminaggregation is more appropriate for BoN in their findings (Section 3.2.4; Figure 9; Tables 13â14).