Your Group-Relative Advantage Is Biased¶

🎯 Pitch¶

This paper proves that commonly used group-relative advantage estimators in RL from verifier rewards are intrinsically biased—systematically underestimating advantages on hard prompts and overestimating them on easy prompts—and characterizes this effect theoretically and probabilistically. It introduces HA-DW, a lightweight history-aware adaptive reweighting scheme that uses a cross-batch difficulty anchor to correct the bias, and demonstrates consistent improvements across GRPO-style algorithms on multiple mathematical reasoning benchmarks, improving exploration of hard cases and overall training robustness.

1. Executive Summary (2-3 sentences)¶

This paper analyzes group-relative RL methods for Reinforcement Learning from Verifier Rewards (RLVR)—notably GRPO-style algorithms that estimate advantages by subtracting a within-prompt group-average reward instead of learning a critic. It proves that once you condition on the “non-degenerate” groups that actually produce gradients (neither all-correct nor all-incorrect), the standard group-relative advantage estimator is inherently biased: it underestimates learning signal on hard prompts and overestimates it on easy prompts (Theorem 1, Section 2.2). To mitigate this, it proposes History-Aware Adaptive Difficulty Weighting (HA-DW), a plug-in reweighting scheme using a cross-batch “difficulty anchor,” and shows consistent accuracy gains on five math reasoning benchmarks when integrated into GRPO variants (Table 1, Section 5.1).

2. Context and Motivation¶

Problem/gap addressed
RLVR post-training often uses verifier rewards, frequently binary pass/fail signals on reasoning tasks (Section 2.2).
Group-relative algorithms (e.g., GRPO, GSPO, DAPO) avoid a learned critic by:
1. Sampling a small group of rollouts per prompt.
2. Using the group’s average reward as a baseline.
3. Updating the policy using within-group relative advantages (Eq. (2), Section 2.1).
Despite widespread adoption, the paper argues there has not been a detailed theoretical characterization of this group-relative advantage estimator’s statistical properties in RLVR settings (Introduction; references to “poorly understood”).
Why it matters
In RL, the advantage is the core learning signal that determines which sampled behaviors get reinforced.
If advantage estimates are systematically skewed by prompt difficulty, the algorithm can:
- Under-learn from hard prompts (insufficient exploration where it’s most needed).
- Over-exploit easy prompts (wasting updates reinforcing already-solved behavior).
This can harm both stability and generalization, especially under practical small rollout counts like \(G=8\) (Section 2.2 and discussion around computational constraints).
Prior approaches and shortcomings (as positioned here)
GRPO and variants (GSPO, DAPO, etc.) differ mainly in how they apply clipping/ratios and whether they operate token-level or sequence-level (Appendix B), but they all rely on the same core group-relative baseline idea.
A straightforward mitigation is “use more rollouts,” but that is computationally expensive and can hit memory limits (Table 3 notes rollout=32 is out-of-memory for their setup).
The paper positions its contribution as exposing an overlooked statistical bias intrinsic to group-relative estimation under the non-degenerate sampling regime, and proposing a lightweight correction that works under fixed rollout budgets.

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a plug-in modification to group-relative policy optimization for RLVR that reweights advantages based on estimated prompt difficulty and training history.
It solves the problem of biased learning signals in group-relative RL by tracking an evolving notion of model capability across batches and using it to scale per-sample advantages before the policy update.

3.2 Big-picture architecture (diagram in words)¶

Prompt sampling: sample a prompt \(x_t \sim D\).
Group rollout generation: sample \(G\) responses \(y_{t,1},\dots,y_{t,G} \sim \pi_{\theta_t}(\cdot\mid x_t)\).
Verifier reward: compute rewards \(r_{t,i}\) (often binary) for each response.
Group-relative advantage: compute baseline \(\hat p_t=\frac{1}{G}\sum_i r_{t,i}\) and advantages \(\hat A_{t,i}=r_{t,i}-\hat p_t\) (Eq. (2)).
HA-DW module:
Update a cross-batch capability belief \(C_t\) from recent training batches (Eq. (13)–(17)).
Compute a history-based difficulty signal \(\mathrm{diff}^{his}_t=\hat p_t - C_t^{-}\) (Eq. (18)).
Produce a weight \(\Phi_{t,i}\) to scale \(\hat A_{t,i}\) (Eq. (21)–(22)).
Policy optimization: run GRPO/GSPO/DAPO objective, but multiply advantages by \(\Phi_{t,i}\) (e.g., Eq. (33), (37), (39)).

(Figure 3 depicts this two-phase “history anchor + difficulty reweighting” pipeline.)

3.3 Roadmap for the deep dive¶

Explain the baseline group-relative objective and what “advantage” means here (Eq. (1)–(4)).
Show the core bias mechanism created by conditioning on non-degenerate groups (Section 2.2; Theorems 1–2).
Describe the history anchor \(C_t\): what signal it tracks and how it updates (Eq. (12)–(17)).
Describe HA-DW weighting: how \(\Phi_{t,i}\) is computed and how it changes updates (Eq. (18)–(22)).
Summarize the theoretical guarantee that reweighting can reduce bias (Lemma 1, Theorem 3).
Connect to practical instantiations inside GRPO/GSPO/DAPO (Appendix B) and the reported experiments (Section 5).

3.4 Detailed, sentence-based technical breakdown¶

This is an algorithmic + theoretical analysis paper: it (i) proves a bias property of a widely-used estimator in group-based RLVR and (ii) proposes a correction method (HA-DW) with supporting theory and experiments.

3.4.1 Core RLVR / Group-PO setup and notation¶

At training step \(t\), the method samples a prompt \(x_t\) from a dataset distribution \(D\) (Section 2.1).
It generates \(G\) independent responses from the current policy: [ y_{t,i}\sim \pi_{\theta_t}(\cdot\mid x_t),\quad i\in{1,\dots,G}. ]
Each response receives a scalar reward \(r_{t,i}\), often binary in verifier-based reasoning tasks: [ r_{t,i}\in{0,1}. ]
The general group-relative policy optimization objective is written as (Eq. (1)): [ J_{\text{group}}(\theta)=\frac{1}{G}\sum_{i=1}^G \psi!\left(\frac{\pi_\theta(y_{t,i}\mid x_t)}{\pi_{\theta_{\text{old}}}(y_{t,i}\mid x_t)}\right)\; \phi(\hat A_{t,i}), ] where:
\(\pi_{\theta_{\text{old}}}\) is the behavior/reference policy for importance ratios,
\(\psi(\cdot)\) is a transform on the ratio (identity/clipping/log etc.),
\(\phi(\cdot)\) is a transform on advantage (kept general to cover variants).
The group-relative advantage estimator is (Eq. (2)): [ \hat A_{t,i}=r_{t,i}-\hat p_t,\qquad \hat p_t=\frac{1}{G}\sum_{i=1}^G r_{t,i}. ] Intuition: \(\hat p_t\) is a within-prompt baseline; a response is “good” if it beats the group average.
The paper defines the expected reward for prompt \(x_t\) as (Eq. (3)): [ p_t=\mathbb{E}{y_t\sim \pi[r(y_t)] =\Pr(r(y_t)=1\mid x_t,\pi_{\theta_t}). ]}(\cdot\mid x_t)
It defines the expected advantage as (Eq. (4)): [ A_{t,i}=r_{t,i}-p_t. ] This is the “true” baseline-subtracted signal you would get if you knew \(p_t\) exactly.

3.4.2 Where the bias comes from: conditioning on non-degenerate groups¶

Under the binary reward modeling assumption, the paper treats each reward as: [ r_{t,i}\sim \mathrm{Bernoulli}(p_t)\quad\text{(Eq. (5))}. ]
Let the total number of correct responses in the group be: [ R=\sum_{i=1}^G r_{t,i},\qquad \hat p_t=\frac{R}{G}. ]
Many GRPO-style implementations yield zero learning signal if all responses are correct or all are incorrect, because then \(\hat A_{t,i}=0\) for all \(i\).
All incorrect: \(R=0\Rightarrow \hat p_t=0\Rightarrow \hat A_{t,i}=0\).
All correct: \(R=G\Rightarrow \hat p_t=1\Rightarrow \hat A_{t,i}=0\).
The paper formalizes focusing on the effective update regime by conditioning on the non-degenerate event (Eq. (6)): [ S:={1\le R\le G-1}. ] The stated motivation is that conditioning on \(S\) isolates the samples that actually drive learning in these group-relative algorithms.
Key result (Theorem 1, Section 2.2): for any response \(i\), conditioning on \(S\) induces a systematic bias: [ \mathbb{E}[\hat A_{t,i}\mid S] < A_{t,i}\quad\text{if } p_t<0.5, ] [ \mathbb{E}[\hat A_{t,i}\mid S] > A_{t,i}\quad\text{if } p_t>0.5, ] with equality only at \(p_t=0.5\).
Interpretation: when a prompt is “hard” for the current policy (\(p_t<0.5\)), the group-relative estimator tends to be too pessimistic; when a prompt is “easy” (\(p_t>0.5\)), it tends to be too optimistic.
Figure 2 visualizes how the bias magnitude grows as \(p_t\) moves away from \(0.5\), for different group sizes \(G\).
The paper further gives a distribution-level statement (Theorem 2, Eq. (7)–(8)) that computes the exact conditional probability of an estimation error larger than \(\epsilon\), again split by hard vs easy prompts. These expressions are sums over binomial probabilities, normalized by the probability of the non-degenerate event.
It then specializes to practical rollout sizes \(2\le G\le 8\) and gives lower bounds on how often the estimator has the wrong sign relative to the true advantage (Corollary 1, Eq. (9)), e.g. probabilities \(>0.63\) in the broad hard/easy regions and \(>0.78\) in more extreme regions.
In the extreme difficulty regimes, the bias becomes deterministic (Corollary 3, Eq. (11)): [ \hat A_{t,i} < A_{t,i}\ \text{if}\ p_t<\frac{1}{G},\qquad \hat A_{t,i} > A_{t,i}\ \text{if}\ p_t>\frac{G-1}{G}. ]

A small worked micro-example (illustrative, derived from the paper’s definitions): - Suppose \(G=8\) and for a hard prompt the true success probability is \(p_t=0.1\). - A rare but crucial learning event is getting exactly one correct rollout: \(R=1\), so \(\hat p_t=1/8=0.125\). - For that correct rollout \(i\) with \(r_{t,i}=1\): - Group-relative advantage: \(\hat A_{t,i}=1-0.125=0.875\). - Expected advantage baseline would subtract \(p_t\): \(A_{t,i}=1-0.1=0.9\). - The group-relative estimator is smaller in this case, matching the paper’s “hard prompts are underestimated” direction. The paper’s theorems make this systematic once you condition on \(S\), not just a property of one outcome.

3.4.3 HA-DW Part 1: an evolving difficulty anchor \(C_t\) (cross-batch history)¶

The paper introduces a cross-batch quantity intended to represent “solving capability” as a latent belief state \(C_t\) (Section 3.1).
For a training batch \(t\), let \(B_t\) be the total number of responses and \(K_t=\sum_{i=1}^{B_t} r_{t,i}\) the number of correct responses. The observed batch accuracy is (Eq. (12)): [ y_t=\frac{K_t}{B_t}. ]
The belief update is a convex combination (Eq. (13)): [ C_t^{+}=(1-\eta_t)C_t^{-}+\eta_t\, y_t,\qquad \eta_t\in[0,1]. ]
\(C_t^{-}\) is the prior belief before observing batch \(t\),
\(C_t^{+}\) is the posterior belief after incorporating \(y_t\).
The forgetting factor is adaptive:
Compute mean over the previous \(m\) batches (Eq. (14)): [ \bar C_t=\frac{1}{m}\sum_{j=1}^m C_{t-j}. ]
Compute standard deviation (Eq. (15)): [ \sigma_t=\sqrt{\frac{1}{m}\sum_{j=1}^m (C_{t-j}-\bar C_t)^2}. ]
Set (Eq. (16)): [ \eta_t=\eta\cdot \sigma_t, ] where \(\eta\) is a task-dependent hyperparameter.
The posterior becomes the next prior (Eq. (17)): [ C_t^{+}\rightarrow C_{t+1}^{-}. ]
Appendix F also gives a simplified “hard update” version averaging over the last \(h\) accuracies (Eq. (137)).

Intuition in plain language: \(C_t\) is a running estimate of “how good the model currently is,” smoothed across batches but allowed to adapt faster when training is unstable (large \(\sigma_t\)).

3.4.4 HA-DW Part 2: difficulty-adaptive reweighting \(\Phi_{t,i}\)¶

The paper defines a history-based prompt difficulty (Eq. (18)): [ \mathrm{diff}^{his}_t=\hat p_t - C_t^{-}. ]
If \(\hat p_t < C_t^{-}\), the prompt looks harder than expected given current capability.
If \(\hat p_t > C_t^{-}\), the prompt looks easier than expected.
It then defines a direction variable (Eq. (19)): [ D_{t,i}=-\mathrm{sgn}(\hat A_{t,i})\cdot \mathrm{sgn}(\mathrm{diff}^{his}_t). ] This ties the direction of reweighting to both:
whether the sampled response is above/below the group baseline (sign of \(\hat A_{t,i}\)),
whether the prompt is easier/harder than the anchor (sign of \(\mathrm{diff}^{his}_t\)).
It defines the magnitude of difficulty deviation (Eq. (20)): [ M_t=\left|\mathrm{diff}^{his}_t\right|. ]
The reweighting factor is exponential (Eq. (21)): [ \Phi_{t,i}=\lambda_{\text{scale}}\cdot \exp(D_{t,i}\cdot M_t), ] where \(\lambda_{\text{scale}}\) is a scaling constant.
The resulting objective multiplies the original advantage term by \(\Phi_{t,i}\) (Eq. (22)): [ L_{\text{HA-DW}}(\theta)=\frac{1}{G}\sum_{i=1}^G \psi!\left(\frac{\pi_\theta(y_{t,i}\mid x_t)}{\pi_{\theta_{\text{old}}}(y_{t,i}\mid x_t)}\right)\; \phi(\hat A_{t,i})\;\Phi_{t,i}. ]

Stated effect: \(\Phi_{t,i}\) is designed to amplify learning signal on difficult prompts (where group-relative advantage is “conservative”) and suppress it on easy prompts (where it is “inflated”), improving the exploration–exploitation balance (Figure 3 narrative; Section 3.2).

3.4.5 How HA-DW plugs into GRPO / GSPO / DAPO (Appendix B)¶

The paper provides explicit substitutions where \(\hat A\) becomes \(\hat A\cdot \Phi\):

GRPO (token-level clipped objective):
Base GRPO objective (Eq. (30)) uses per-token ratios \(r_{t,i,\tau}(\theta)\) (Eq. (31)) and group-normalized advantages (Eq. (32)).
HA-DW version (Eq. (33)) multiplies the advantage by \(\Phi_{t,i}\) inside the clipped minimum.
GSPO (sequence-level clipped objective):
Base objective (Eq. (34)) uses sequence-level ratio \(r_{t,i}(\theta)\) (Eq. (35)) and group-normalized advantage (Eq. (36)).
HA-DW version (Eq. (37)) again uses \(\hat A_{t,i}\cdot \Phi_{t,i}\).
DAPO (token-level, different normalization/denominator):
Base objective (Eq. (38)); HA-DW version (Eq. (39)) multiplies advantage by \(\Phi_{t,i}\).

3.4.6 Theoretical guidance: HA-DW can reduce bias (Lemma 1, Theorem 3)¶

The analysis in Section 4 starts by looking at “baseline rectification”: scale the empirical baseline \(\hat p_t\) by a factor \(c\) to form \(\tilde p_t=c\hat p_t\) (Lemma 1).
Under assumptions \(p_t\in[\Delta,1-\Delta]\) and with probability at least \(1-\delta\) (conditional on \(S\)), Lemma 1 constructs feasible bounds \((c_{\text{low}},c_{\text{high}})\) such that choosing \(c\) in this range yields: [ \mathbb{E}[\tilde p_t\mid S]\in (p_t-\epsilon,\ p_t+\epsilon). ] (Eq. (23)–(27) formalize \(\epsilon_\delta\), the interval \(I_t\), and \(c_{\text{low}},c_{\text{high}}\).)
The main statement (Theorem 3) provides a sufficient condition on \(\lambda_{\text{scale}}\) (Eq. (28)) under which HA-DW reduces the bias magnitude in expectation: [ \left|\mathbb{E}[\hat A_{t,i}\cdot \Phi_{t,i}\mid S]-A_{t,i}\right| < \left|\mathbb{E}[\hat A_{t,i}\mid S]-A_{t,i}\right|. ] (Eq. (29).)
Interpretation: with an appropriate scaling choice, the reweighted estimator is provably closer (in expected absolute error) to the “true” advantage.

3.4.7 Training configuration and hyperparameters (what the paper provides)¶

From Section 5 and Appendix C / Table 8:

Hardware/framework: training is done in the VeRL framework on a single node with 8 × NVIDIA A100 GPUs (Section 5 “Setups”).
Batch/sequence limits:
Maximum prompt batch size: 1024.
Maximum response length: 4096 tokens (Appendix C; Table 8).
Rollouts per prompt: rollout.n = 8 in the main hyperparameter table (Table 8).
Optimization:
Optimizer: AdamW.
Learning rate: \(1\times 10^{-6}\).
Weight decay: 0.1.
Warmup steps: 10.
Gradient clip: 1.0.
Train batch size: 256; mini batch size: 16; micro batch size: 4 (Table 8).
Epochs:
GRPO/GSPO: 3 epochs.
DAPO: 9 epochs (Table 8).
Clipping parameters (method-specific) (Table 8):
GRPO: clip-high/low = 0.2 / 0.2.
GSPO: clip-high/low = 0.0004 / 0.0003.
DAPO: clip-high/low = 0.28 / 0.2 (and note DAPO uses decoupled clipping in its method description).

What is not specified in the provided content (so cannot be reported): - Model architecture hyperparameters such as number of layers, hidden size, attention heads, tokenizer details, or context window beyond max prompt/response lengths. The paper names models (Qwen3-4B/8B, LLaMA-3.2-3B-Instruct) but does not provide their architectural configs in the excerpted text.

4. Key Insights and Innovations¶

1) Formal proof that group-relative advantage is biased under non-degenerate sampling (Theorem 1, Section 2.2).
Novelty: The paper isolates the fact that practical GRPO-style training effectively conditions on \(S=\{1\le R\le G-1\}\) because degenerate groups yield zero gradient, and shows this conditioning alone induces systematic bias.
Significance: It explains a concrete mechanism for why training may skew toward easy prompts and away from hard ones even if the algorithm seems “symmetric” at first glance.
2) Distribution-level characterization of error likelihood under finite \(G\) (Theorem 2, Eq. (7)–(8); Corollaries 1–3).
Novelty: Beyond expected bias, it gives exact binomial-sum probabilities of under/overestimation exceeding \(\epsilon\), and derives practically interpretable bounds for small rollout sizes \(2\le G\le 8\) (Eq. (9)).
Significance: This directly targets the RLVR regime where rollouts are small due to cost, and quantifies that the problematic behavior happens with high probability.
3) HA-DW: a plug-in reweighting scheme using cross-batch capability anchoring + difficulty scaling (Section 3; Eq. (12)–(22)).
Novelty: The combination of (i) a history-aware capability belief \(C_t\) and (ii) exponential, sign-conditioned difficulty weighting \(\Phi_{t,i}\) is the core algorithmic contribution.
Significance: It is designed to be compatible with multiple group-relative objectives (GRPO/GSPO/DAPO) as a simple multiplicative factor.
4) A sufficient-condition guarantee that reweighting can reduce expected bias (Theorem 3, Eq. (28)–(29)).
Novelty: Theorem 3 provides a formal statement that the proposed correction can improve the estimator (in expected absolute deviation from the true advantage) given an admissible range for \(\lambda_{\text{scale}}\).
Significance: This supplies principled guidance for \(\lambda_{\text{scale}}\) selection, at least at the level of existence/range conditions.

5. Experimental Analysis¶

Evaluation methodology¶

Models (Section 5 “Setups”; Appendix C):
Qwen3-4B-Base
Qwen3-8B-Base
LLaMA-3.2-3B-Instruct
Training dataset: MATH training split, 7.5k questions (Appendix C).
Benchmarks (Section 5; Table 1):
MATH500
AIME25
AMC23
Minerva
OlympiadBench
Metric: accuracy; additionally, for small benchmarks they report avg@16 on AIME25 and AMC23 to reduce variance (Appendix C).
Baselines:
Base group-relative algorithms: GRPO, GSPO, DAPO.
Their variants with HA-DW: + HA-DW applied on top.

Main quantitative results (Table 1)¶

Across all three model families/scales, HA-DW improves the average score:

Qwen3-4B-Base:
GRPO: AVG 46.5 → +HA-DW: 48.7.
GSPO: AVG 47.1 → +HA-DW: 49.2.
DAPO: AVG 46.8 → +HA-DW: 49.5.
Qwen3-8B-Base:
GRPO: AVG 49.6 → +HA-DW: 52.5.
GSPO: AVG 50.2 → +HA-DW: 51.7.
DAPO: AVG 50.7 → +HA-DW: 53.4.
LLaMA-3.2-3B-Instruct:
GRPO: AVG 25.7 → +HA-DW: 27.1.
GSPO: AVG 24.9 → +HA-DW: 25.8.
DAPO: AVG 26.5 → +HA-DW: 28.1.

These are consistent gains rather than a single-task improvement.

The paper stratifies MATH500 into difficulty buckets (Easy: Level 1; Mid: Levels 2–3; Hard: Levels 4–5) and reports:
Similar performance on Easy/Mid.
A +3.4% improvement on Hard prompts for GRPO+HA-DW vs GRPO (Section 5.1; Figure 1(c) described in text).

Rollout budget comparison (Table 3)¶

For Qwen3-4B-Base on GRPO:

Rollout=8 (base): AVG corresponds to Table 1 GRPO results; Table 3 shows per-benchmark:
MATH500 75.4
AIME25 19.6
AMC23 60.3
Minerva 33.8
OlympiadBench 43.5
Rollout=16 improves some benchmarks (e.g., MATH500 76.2, AMC23 61.6), but rollout=8 + HA-DW does better:
MATH500 78.0, AIME25 20.4, AMC23 63.4, Minerva 36.8, OlympiadBench 44.7.
The paper notes rollout=32 with GRPO is out-of-memory in their setup (Table 3 caption), reinforcing the compute-motivation for bias correction.

Ablations¶

Dynamic threshold / anchor vs fixed thresholds (Table 2) on Qwen3-4B-Base GRPO-based training:
Baseline AVG 46.5.
Fixed thresholds (0.4/0.5/0.6) improve to AVG 48.1 / 47.8 / 48.0.
Dynamic \(C_t\) achieves AVG 48.7, the best among these options.
This supports the claim that cross-batch history improves over static difficulty settings.
Scaling parameter \(\lambda_{\text{scale}}\) (Table 7) on Qwen3-4B-Base GRPO+HA-DW:
Best reported AVG values occur at \(\lambda_{\text{scale}}=1.3\) (AVG 48.7) and \(1.5\) (AVG 48.6), with worse performance at too-small or too-large values.

Training dynamics (Figure 4, described)¶

The paper reports that HA-DW variants:
Reach higher accuracy plateaus.
Achieve higher training reward.
Encourage longer response lengths over training (Figure 4 description).
These observations are used to argue HA-DW “boosts exploration on hard prompts and weakens exploitation on easy prompts.”

Do the experiments support the claims?¶

Supportive aspects
Improvements appear across:
- multiple algorithms (GRPO/GSPO/DAPO),
- multiple model scales/families,
- multiple benchmarks (Table 1).
The hard-prompt stratification result aligns with the theoretical narrative (Section 5.1, Figure 1(c) text).
The rollout comparison shows HA-DW can outperform simply increasing rollouts within feasible budgets (Table 3).
What is not fully validated in the provided excerpt
Direct measurement of the theoretical quantity \(p_t\) per prompt is difficult; the paper uses empirical proxying (Appendix E.1 describes comparing rollout=8 vs rollout=128 distributions) but the detailed plots/results are in the appendix figures and text.
The theorem-guided range condition for \(\lambda_{\text{scale}}\) (Eq. (28)) is not instantiated numerically in the main text; instead, they empirically sweep \(\lambda_{\text{scale}}\) (Table 7).

6. Limitations and Trade-offs¶

Dependence on the non-degenerate conditioning assumption
The bias theorems explicitly condition on \(S=\{1\le R\le G-1\}\) (Eq. (6)).
This matches the claim that degenerate groups yield zero gradients and are ignored/discarded, but the exact handling of degenerate groups can vary by implementation; the theory is tied to the “effective update regime” framing.
Reward modeling assumptions
The main analysis assumes binary rewards (Bernoulli) (Eq. (5)), motivated by verifier pass/fail settings.
The paper claims extension to continuous bounded rewards in Appendix D.5 (Theorem 4), but the core experimental results presented are on math-verifier-like tasks that plausibly match the binary regime.
Extra hyperparameters and potential tuning burden
HA-DW introduces parameters such as \(\lambda_{\text{scale}}\), \(\eta\), and the history window \(m\) (Eq. (14)–(16)), plus any choices implicit in how \(C_t\) is computed.
Table 7 shows performance varies with \(\lambda_{\text{scale}}\), implying tuning matters.
Risk of mis-weighting due to noisy difficulty estimates
\(\mathrm{diff}^{his}_t=\hat p_t-C_t^{-}\) relies on \(\hat p_t\), which itself is noisy under small \(G\).
The method attempts to stabilize via cross-batch \(C_t\), but the overall weighting may still be sensitive early in training or under distribution shift.
Scope: tailored to group-relative methods
The paper’s own “Limitations” section notes the approach is designed for group-relative methods and does not directly address other RLVR estimators or credit assignment schemes.
Missing reporting details (in the provided content)
The excerpt does not provide full model architecture/tokenizer/context-window specs; this limits reproducibility assessment strictly from the provided text.

7. Implications and Future Directions¶

How this changes the landscape (within RLVR group-relative training)
The paper reframes a common design choice—“use group mean reward as baseline”—as not merely a variance-reduction trick but as a source of systematic bias once you focus on the samples that produce learning.
This suggests that improving RLVR stability/performance may require not only better clipping/ratios (as many GRPO variants do) but also explicit bias correction in advantage estimation.
Follow-up research directions suggested by the work
Extend the bias-correction idea beyond group-relative RL to other advantage/credit-assignment settings (explicitly mentioned in the paper’s Limitations/Future work statement).
More principled or automated selection of \(\lambda_{\text{scale}}\) using the theoretical intervals (Eq. (28)) rather than empirical sweeps.
Explore stronger difficulty estimators than \(\hat p_t\) under small rollout budgets, potentially combining with adaptive sampling (the paper references sampling-related works, but does not implement them here).
Practical applications / downstream use cases
Any RLVR post-training pipeline for reasoning tasks where:
- rewards are binary or bounded,
- rollouts per prompt are small due to compute cost,
- group-relative methods are used to avoid a critic.
The benchmarks used are mathematical reasoning datasets (MATH500, AIME25, AMC23, Minerva, OlympiadBench), and HA-DW improves accuracy across them (Table 1).
Repro/Integration Guidance
When to prefer HA-DW (based on this paper’s evidence):
- If you already use GRPO, GSPO, or DAPO with small rollouts (e.g., \(G=8\)), HA-DW is designed as a plug-in multiplier on the advantage term (Eq. (33), (37), (39)).
- If you are considering increasing rollouts to reduce estimation error, Table 3 suggests HA-DW can yield larger gains than doubling rollouts from 8 to 16 (in their Qwen3-4B experiment) while avoiding memory issues.
Key knobs to tune (as evidenced here):
- \(\lambda_{\text{scale}}\): Table 7 indicates a middle range (e.g., 1.3–1.5 in their setup) performs best.
- History anchor behavior: Table 2 suggests dynamic \(C_t\) outperforms fixed difficulty thresholds.
Implementation footprint:
- Maintain \(C_t\) (Eq. (13)–(17)) and compute \(\Phi_{t,i}\) (Eq. (19)–(21)) per sample/group, then multiply existing advantages by \(\Phi_{t,i}\) before applying the standard GRPO/GSPO/DAPO update.