GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization¶

🎯 Pitch¶

The paper introduces GDPO, a simple policy-optimization modification that decouples group-wise normalization across individual reward components and then applies batch-wise advantage normalization—preserving fine-grained multi-reward distinctions that GRPO’s summed-normalization collapses. By restoring signal resolution, GDPO yields substantially more stable training and better trade-offs between correctness and constraint adherence across tool-calling, math, and coding RL tasks, making multi-objective alignment of language models more reliable and effective.

1. Executive Summary (2-3 sentences)¶

GDPO is a multi-reward reinforcement learning (RL) policy-optimization method that fixes a failure mode in applying GRPO to multiple heterogeneous rewards. The core idea is to normalize each reward component separately within each prompt’s rollout group, then aggregate, and finally apply a batch-wise advantage normalization to keep update magnitudes stable as the number of rewards grows. Across tool-calling, math reasoning, and coding reasoning tasks, this yields more stable training and better trade-offs between correctness and constraint adherence than summing rewards and running standard GRPO.

2. Context and Motivation¶

Problem / gap addressed
RL alignment pipelines for language models increasingly optimize multiple rewards (e.g., correctness + formatting + length + safety constraints).
Recent work often defaults to GRPO by summing reward components and then applying group-relative normalization, without checking whether that normalization preserves multi-reward signal resolution.
Why it matters
Multi-reward settings require the optimizer to distinguish how different objectives are satisfied (e.g., “correct and well-formatted” vs. “only correct”).
If the advantage signal loses these distinctions, policy updates can become inaccurate, hurting convergence and sometimes causing training collapse (the paper highlights instability in later training for GRPO in math/length experiments; see Figure 5 discussion).
Prior approach and its shortcoming
Standard multi-reward practice: compute a scalar reward as a weighted sum (here, often an unweighted sum) and run GRPO:
- Sum rewards (Eq. (1)), then group-normalize the sum to compute advantage (Eq. (2)), then run a PPO-style clipped objective (Eq. (3)).
Key shortcoming identified: group-wise normalization over the sum can compress distinct reward combinations into identical advantage values (“reward signal collapse”), reducing the “resolution” of learning signals (Section 2, Figure 2).
How this paper positions itself
Instead of proposing new reward functions, it asks a more basic question: is GRPO suitable for multi-reward optimization as commonly used?
It proposes GDPO as a minimal modification to GRPO that is specifically tailored to multi-reward training by changing how normalization is done (Section 3.1, Figure 1a).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a policy optimization algorithm for RL fine-tuning language models using multiple reward signals.
It solves multi-objective alignment by changing the advantage normalization step so that each reward dimension preserves its own relative differences before rewards are combined.

3.2 Big-picture architecture (diagram in words)¶

(1) Rollout sampler: for each prompt/question q_i, sample a group of G responses {o_j} from the current/behavior policy π_{θ_old}.
(2) Reward evaluators: compute n reward components r_k^{(i,j)} for each rollout (e.g., correctness, format, length, bug-free).
(3) Advantage computation:
Baseline (GRPO): sum rewards → group-normalize summed reward.
Proposed (GDPO): group-normalize each reward component → sum normalized components → batch-normalize final advantages.
(4) Policy update: apply a clipped policy-gradient objective using token-level importance ratios (Eq. (3)), optionally with a KL term (paper notes it is omitted in the displayed equation).

3.3 Roadmap for the deep dive¶

Explain GRPO’s multi-reward formulation and where normalization enters (Eqs. (1)–(3)).
Show the reward collapse mechanism with the paper’s small discrete example (Section 2, Figure 2).
Present GDPO’s decoupled normalization and batch-wise normalization (Eqs. (4)–(6)).
Describe how reward priorities are handled via weights and conditioned rewards (Eqs. (7)–(8)).
Summarize the concrete training/evaluation setups and what changes in practice across tasks (Section 4, Tables 1–5, Tables 6–7).

3.4 Detailed, sentence-based technical breakdown¶

This is an algorithmic contribution with empirical evaluation: it modifies GRPO’s advantage estimation to better support multi-reward RL and validates the change across multiple tasks.

3.4.1 What `GRPO` does in the common multi-reward setup (baseline in this paper)¶

The training data contains prompts/questions q_i, and for each prompt the behavior policy π_{θ_old} samples a group of G rollouts/responses {o_j}_{j=1..G} (Section 2).
In a multi-reward setting with n objectives, the common approach builds a single scalar reward by summing reward components: $$ r_{\text{sum}}^{(i,j)} = r_1^{(i,j)} + \cdots + r_n^{(i,j)} \tag{1} $$
GRPO then computes a group-relative advantage by normalizing the summed reward within the group for the same prompt: $$ A_{\text{sum}}^{(i,j)} = \frac{ r_{\text{sum}}^{(i,j)} - \operatorname{mean}{r_{\text{sum}}^{(i,1)},\dots,r_{\text{sum}}^{(i,G)}} }{ \operatorname{std}{r_{\text{sum}}^{(i,1)},\dots,r_{\text{sum}}^{(i,G)}} } \tag{2} $$
The policy is updated with a PPO-style clipped objective that operates at the token level (Eq. (3)), using an importance ratio: $$ s_{i,t}(\theta)= \frac{\pi_\theta(o^j_t \mid q, o^j_{<t})}{\pi_{\theta_\text{old}}(o^j_t \mid q, o^j_{<t})} $$ and then optimizing the clipped surrogate (the paper omits the KL term “for clarity”): $$ \mathcal{J}{\text{GRPO}}(\theta)= \mathbb{E}\left[ \frac{1}{G}\sum \min\Big(s_{i,t}(\theta)A_{\text{sum}}^{(i,j)}, \ \operatorname{clip}(s_{i,t}(\theta),1-\epsilon,1+\epsilon)\,A_{\text{sum}}^{(i,j)}\Big) \right] \tag{3} $$ where }^G \frac{1}{|o_j|}\sum_{t=1}^{|o_j|ε is the clipping threshold.

3.4.2 Why summing first + group-normalizing can collapse the training signal¶

The paper’s key observation is that different combinations of component rewards can map to the same normalized advantage after Eq. (2), especially when rewards are discrete/coarse or the group size is small (Section 2, Figure 2).
Worked micro-example (the one used by the paper, Section 2 / Figure 2):
Setting: G=2 rollouts per prompt, and n=2 binary rewards r_1, r_2 ∈ {0,1}.
Each rollout’s total reward is in {0,1,2}.
Consider a group where rollout totals are (0,1):
- Mean is 0.5, standard deviation is 0.7071, so normalized advantages are approximately (-0.7071, +0.7071).
Now consider totals (0,2):
- Mean is 1, standard deviation is 1.4142, so normalized advantages are also (-0.7071, +0.7071).
In other words, a group where one rollout satisfies both rewards (total 2) produces the same advantage pattern as a group where it satisfies only one reward (total 1), losing a distinction that should matter for learning signal strength (Section 2 explanation).
The paper generalizes this “collapse” intuition by counting the number of distinct advantage groups that can arise under different numbers of rollouts/rewards, showing GDPO preserves substantially more distinct advantage values than both GRPO and a GRPO variant without standard deviation normalization (Figure 3).

3.4.3 Why “GRPO without std normalization” is not enough¶

The paper discusses a variant used in some recent works where the standard deviation term is removed: $$ A_{\text{sum}}^{(i,j)} = r_{\text{sum}}^{(i,j)} - \operatorname{mean}(\cdot) $$ (described in Section 2; equation form not re-labeled there).
This can separate some cases (e.g., (0,1) vs (0,2) become different), but the paper reports:
Only modest improvement in distinct advantage diversity when scaling rollouts/rewards (Figure 3).
Empirically, it does not improve convergence and can be unstable: in tool-calling, the “GRPO w/o std” variant achieves 0% correct format on BFCL-v3 despite reasonable correctness reward curves (Section 4.1.1, Table 2, Figure 4).

3.4.4 `GDPO`: decouple normalization per reward, then stabilize scale with batch normalization¶

GDPO changes only the advantage computation, leaving the overall policy-gradient update form similar in spirit to GRPO.
Step 1: for each reward component k ∈ {1..n}, compute a per-reward group-normalized advantage within the rollout group for the same prompt: $$ A_k^{(i,j)} = \frac{ r_k^{(i,j)} - \operatorname{mean}{r_k^{(i,1)},\dots,r_k^{(i,G)}} }{ \operatorname{std}{r_k^{(i,1)},\dots,r_k^{(i,G)}} } \tag{4} $$
Step 2: aggregate across objectives by summing the normalized components: $$ A_{\text{sum}}^{(i,j)} = A_1^{(i,j)} + \cdots + A_n^{(i,j)} \tag{5} $$
Step 3: apply batch-wise normalization over all rollouts in the current training batch to keep the advantage scale stable as n grows: $$ \hat{A}{\text{sum}}^{(i,j)}= \frac{ A,\ j'=1..G} }{ \operatorname{std}{A_{\text{sum}}^{(i',j')}\mid i'\in D_{\text{Batch}},\ j'=1..G}+\epsilon } \tag{6} $$}}^{(i,j)} - \operatorname{mean}{A_{\text{sum}}^{(i',j')}\mid i'\in D_{\text{Batch}
Here ε is an additive stabilizer in the denominator (distinct from the PPO clip threshold ε in Eq. (3); the paper uses the same symbol, so this interpretation is necessary to avoid confusion).
Mechanistic intuition (how it fixes the collapse shown in Figure 2):
If you normalize only the sum, you can’t tell whether a high sum came from one reward being high vs. multiple rewards being jointly satisfied, once everything is standardized within a group.
If you normalize each reward separately, then differences along each reward axis remain visible in the aggregated advantage (Section 3.1, Figure 2 discussion).
Stability note:
The paper states batch-wise normalization improves stability and shows that removing it can cause occasional convergence failures (Appendix A, Figure 8).

3.4.5 Handling different objective priorities: weights vs conditioned rewards¶

The paper treats this as part of “effective incorporation of priority variation” (Section 3.2).

Weighted objectives
Standard approach: choose weights w_k to reflect priority and form a weighted sum. In GDPO, weights apply to the normalized per-reward advantages: $$ A_{\text{sum}}^{(i,j)} = w_1 A_1^{(i,j)} + \cdots + w_n A_n^{(i,j)} \tag{7} $$
Practical issue reported: if one reward is much easier, the model may optimize it “regardless of assigned weights” unless weights become extremely imbalanced (Section 3.2; examined in Section 4.2.1).
Conditioned rewards (gating an easy reward on a hard reward)
To prevent an easy objective from dominating, the paper describes conditioning reward r_k on another reward r_l meeting a threshold t: $$ r_k= \begin{cases} r_k,& \text{if } r_l \ge t\ 0,& \text{otherwise} \end{cases} \tag{8} $$
Instantiation in math reasoning (Section 4.2.1): replace the length reward with a conditioned length reward that only pays out if the answer is correct:
- \tilde{R}_{length} = 1 iff length ≤ l and R_correct = 1, else 0.
The paper claims this makes weight tuning behave more predictably afterward (Figure 6 discussion; Tables 8–9).

3.4.6 System/data pipeline “what happens first, second, third” (end-to-end training loop)¶

Putting the algorithm into an operational pipeline consistent with the paper’s setups:

Sample a minibatch of prompts from the RL training dataset (e.g., tool-calling mixture totaling 4k prompts; math dataset 40k prompts; coding dataset 24k prompts).
For each prompt, sample G rollouts from the current policy (e.g., G=4 in tool calling; G=16 in math/coding).
Compute multiple reward components for each rollout using task-specific reward functions:
Tool calling: R_format ∈ {0,1} and R_correct ∈ [-3,3] (Section 4.1; Appendix C).
Math: R_length ∈ {0,1} and R_correct ∈ {0,1} (Section 4.2).
Coding: R_pass ∈ [0,1], conditioned \tilde{R}_{length} ∈ {0,1}, and R_bug ∈ {0,1} (Section 4.3).
Compute advantages:
Baseline: sum rewards then group-normalize (GRPO, Eq. (2)).
Proposed: group-normalize per reward (Eq. (4)), sum (Eq. (5)), batch-normalize (Eq. (6)).
Update the policy using the clipped objective over tokens (Eq. (3)) under the training framework (verl is used throughout the experiments described).
Evaluate periodically on downstream benchmarks (BFCL-v3 for tool calling; AIME/AMC/MATH/Minerva/OlympiadBench for math; PRIME validation tasks for coding) with specified decoding settings.

3.4.7 Core configurations / hyperparameters reported in the paper (and what is missing)¶

The paper provides partial training hyperparameters (Tables 6–7), plus rollout and evaluation settings in Section 4. Key reported items:

Tool calling training (Section 4.1; Table 6)
Models: Qwen2.5-Instruct-1.5B and Qwen2.5-Instruct-3B.
Steps: “100 steps” in the narrative (Section 4.1).
Rollouts per prompt: n=4 (Table 6: actor_rollout_ref.rollout.n = 4).
Batch size: 512 (Table 6).
Max response length: 1024 tokens (Section 4.1).
Max prompt length: 2048 (Table 6).
Learning rate: 1e-6 (Table 6).
KL coefficient: 0.001 (Table 6).
Total epochs: 15 (Table 6).
PPO minibatch size: 128 (Table 6).
Missing from provided content: optimizer type (e.g., AdamW), beta/epsilon values, LR schedule, model architecture details (layers/hidden size/heads), tokenizer/context window beyond max prompt/response limits, hardware/compute budget.
Math and coding training (Section 4.2 / 4.3; Table 7)
Models: DeepSeek-R1-1.5B, DeepSeek-R1-7B, Qwen3-4B-Instruct.
Math steps: 500 steps (Section 4.2).
Coding steps: 400 steps (Section 4.3).
Rollouts per prompt: 16 (Table 7: actor_rollout_ref.rollout.n = 16).
Batch size: 512 (Table 7).
Max prompt length: 1024 (Table 7).
Learning rate: 1e-6 (Table 7).
Rollout temperature during training: 1 (Table 7).
PPO epochs: 1 (Table 7).
PPO minibatch size: 64 (Table 7).
Clip ratios: low 0.2, high 0.28 (Table 7).
KL loss: coefficient 0.0005, type mse (Table 7).
Group filtering enabled with metric seq_reward (Table 7: algorithm.filter_groups.enable TRUE).
Length target: l = 4000 tokens (Section 4.2).
Missing from provided content: same as above—optimizer details, schedules, compute/hardware, and model architectural hyperparameters.
Inference/evaluation decoding (Section 4.2 and 4.3)
Backend: vLLM.
Temperature: 0.6.
top_p = 0.95.
Max response length: 32k tokens.
Samples per question: 16 for evaluation.

4. Key Insights and Innovations¶

(1) Identification of multi-reward “advantage collapse” in naive GRPO
Novelty: the paper isolates a concrete mechanism where group-normalizing an aggregated reward sum maps many distinct reward combinations to identical normalized advantages (Section 2, Figure 2).
Significance: this is not just lower performance; it can reduce training signal expressiveness and contribute to instability (the paper connects GRPO to declining correctness after ~400 steps in math training; Figure 5 narrative).
(2) GDPO: group-wise decoupled normalization per reward
Novelty: compute per-reward group-normalized advantages (Eq. (4)) before combining objectives.
Significance: increases the number of distinct advantage values available to learning as rollouts/rewards scale (Figure 3), improving multi-reward optimization fidelity.
(3) Batch-wise normalization of aggregated multi-reward advantage
Novelty (relative to just “normalize per reward”): the explicit additional normalization step (Eq. (6)) intended to keep advantage scale independent of the number of rewards.
Significance: empirically linked to training stability; removing it can cause occasional convergence failures (Appendix A, Figure 8).
(4) Practical guidance on priority handling: weighting vs conditioning
The paper argues reward weights alone may not reflect intended priorities when objective difficulties differ (Section 3.2, Figure 6 discussion).
Conditioning an easy reward on a harder one (Eq. (8)) is presented as a more forceful mechanism to enforce priority ordering (Section 4.2.1, Figure 7, Table 4).

5. Experimental Analysis¶

Evaluation methodology (tasks, datasets, metrics, baselines)¶

Baselines compared
GRPO with summed reward (standard baseline throughout).
GRPO w/o std (tool calling ablation; Section 4.1.1).
GDPO (proposed).
Task 1: Tool calling (Section 4.1)
Training data: 4k prompts total: 2k ToolACE, 1k Hammar, 1k xLAM (Section 4.1).
Rewards:
- R_format ∈ {0,1} checks tag/order structure (Appendix C, Eq. (9)).
- R_correct ∈ [-3,3] based on tool name/parameter matching against ground truth (Appendix C).
Training setup: 4 rollouts, batch size 512, max response length 1024, trained for 100 steps, 5 runs (Section 4.1; Table 6).
Evaluation benchmark: BFCL-v3, metrics include “Avg Acc” and “Correct Format” plus subcategories (Table 1).
Task 2: Math reasoning (Section 4.2)
Training data: DeepScaleR-Preview dataset with 40k competition-level problems (Section 4.2).
Rewards:
- R_length ∈ {0,1} for staying within l=4000 tokens.
- R_correct ∈ {0,1} for final answer match.
Training setup: 16 rollouts, batch size 512, max response length 8000, 500 steps (Section 4.2; Table 7).
Evaluation benchmarks: AIME-24, AMC 2022/2023, MATH, Minerva, OlympiadBench, reporting pass@1 and “Exceed” (% outputs exceeding 4000 tokens) (Section 4.2; Table 3).
Task 3: Coding reasoning (Section 4.3)
Training data: Eurus-2-RL with 24k coding problems (Section 4.3).
Rewards:
- R_pass ∈ [0,1] = fraction of test cases passed.
- Conditioned \tilde{R}_{length} pays out only if length ≤ l and R_pass = 1.
- R_bug ∈ {0,1} for no runtime/compilation error.
Training setup: 400 steps, “same hyperparameter configuration used in the mathematical reasoning experiments” (Section 4.3; Table 7 provides that configuration).
Evaluation: PRIME validation tasks (Apps, CodeContests, Codeforces, Taco), metrics: Pass, Exceed, Bug (Table 5).

Main quantitative results (with specific numbers)¶

Tool calling: GDPO > GRPO on both accuracy and formatting (Table 1)
Qwen2.5-Instruct-1.5B:
- Avg Acc: GRPO 30.18% → GDPO 32.81%
- Correct Format: GRPO 76.33% → GDPO 80.66%
Qwen2.5-Instruct-3B:
- Avg Acc: GRPO 39.20% → GDPO 40.87%
- Correct Format: GRPO 81.64% → GDPO 82.23%
Training curves: GDPO converges to higher median reward values for both correctness and format (Figure 4; also Figure 1b for 1.5B tool-calling).
Tool calling ablation: “GRPO w/o std” is unstable for formatting (Section 4.1.1, Table 2, Figure 4)
Qwen2.5-1.5B-Instruct:
- Avg Acc: GRPO 30.18%, GRPO w/o std 29.26%, GDPO 32.81%
- Correct Format: GRPO 76.33%, GRPO w/o std 0%, GDPO 80.66%
Math reasoning: GDPO tends to improve accuracy while dramatically reducing length violations (Table 3)
DeepSeek-R1-1.5B:
- AIME Acc: GRPO 23.1% → GDPO 29.4% (increase of +6.3% as also stated in the intro)
- AIME Exceed: GRPO 10.8% → GDPO 6.5%
- MATH Acc: GRPO 83.6% → GDPO 86.2%
- MATH Exceed: GRPO 1.5% → GDPO 0.8%
DeepSeek-R1-7B:
- AIME Acc: GRPO 50.2% → GDPO 53.1%
- AIME Exceed: GRPO 2.1% → GDPO 0.2%
Qwen3-4B-Instruct:
- AIME Acc: GRPO 54.6% → GDPO 56.9% (increase of +2.3% as stated in the intro)
- AIME Exceed: GRPO 2.5% → GDPO 0.1%
Training dynamics: both methods quickly maximize the easy length reward early, but GDPO recovers correctness and remains stable, while GRPO correctness declines after ~400 steps and max response length rises (Figure 5).
Math priority experiments: weights alone behave inconsistently; conditioning changes behavior (Section 4.2.1)
Figure 6 shows accuracy/exceed trends under varying length reward weights with and without conditioning; the text highlights that decreasing w_length from 0.75 to 0.5 often barely changes exceed rates, implying weights don’t reliably impose priority when difficulties differ.
With conditioned length reward, the paper reports more predictable monotonic changes when adjusting \tilde{w}_{length} (Figure 6, plus Tables 8–9 in Appendix G).
Coding reasoning: GDPO improves multi-objective trade-offs in both 2-reward and 3-reward settings (Table 5)
Two-objective (Pass + conditioned length), DeepSeek-R1-7B:
- CodeContests Pass: GRPO2-obj 63.2% → GDPO2-obj 65.8%
- Taco Pass: GRPO2-obj 45.1% → GDPO2-obj 48.4%
Three-objective (Pass + conditioned length + Bug):
- Codeforces Bug: GRPO3-obj 2.5% → GDPO3-obj 1.8%
- Apps Exceed: GRPO3-obj 11.2% → GDPO3-obj 8.5%
- Pass is similar between GRPO3-obj and GDPO3-obj across tasks, while GDPO3-obj often reduces Exceed and Bug.

Do the experiments support the claims?¶

The tool-calling and math experiments provide direct evidence that GDPO improves convergence and final metrics over GRPO under the paper’s settings (Tables 1 and 3; Figures 4 and 5).
The “GRPO w/o std” ablation strengthens the argument that simply tweaking normalization is not sufficient and can introduce instability (Table 2, Figure 4).
The coding experiments indicate the method continues to work when scaling from 2 to 3 rewards (Table 5), aligning with the paper’s generalization claim.

Ablations, failure cases, robustness checks¶

Ablation: GRPO w/o std (Section 4.1.1) and GDPO w/o BN (Appendix A).
Failure/stability: GDPO without batch-wise normalization sometimes fails to converge (Appendix A, Figure 8).
The paper reports multiple-run statistics for tool-calling (5 runs, median + IQR bands in Figures 1b and 4), which partially addresses variance.

6. Limitations and Trade-offs¶

Incomplete reporting of training stack details
The provided content includes key RL hyperparameters (batch size, rollouts, clip ratios, LR, KL coefficients) but does not specify optimizer type, LR schedule, or hardware/compute budget (Tables 6–7).
This makes exact reproduction harder without consulting the referenced configurations/appendices beyond what is shown here.
Dependence on non-degenerate per-reward variance
Per-reward group normalization (Eq. (4)) divides by the reward’s standard deviation within a rollout group. If a reward is constant across the group for a prompt, the std is zero; the paper does not explicitly state how Eq. (4) is stabilized in that case in the excerpt shown (whereas Eq. (6) explicitly adds + ε).
Practically, an implementation must guard against zero-variance rewards per group; the paper’s text does not spell out the exact handling for Eq. (4).
Conditioning can change the feasible objective set
Conditioning an easy reward on a hard reward (Eq. (8)) enforces priority but can also withhold learning signal for the easy objective until the hard objective is met, which may slow learning of that objective in early phases (this is a conceptual trade-off; the paper emphasizes the priority benefit in Section 4.2.1).
Scope of evaluation
The experiments cover three application types (tools, math, coding), but all are within the LLM RL fine-tuning regime described. The paper does not evaluate, in the provided content, settings like continuous-control RL, non-language policies, or reward models with very different statistical properties.
Additional normalization layers
GDPO introduces more normalization steps (per reward + batch-wise). While empirically helpful here, extra normalization can sometimes interact with reward scaling/weighting in non-obvious ways; the paper partially addresses this with the batch-wise normalization motivation and ablation (Eq. (6), Appendix A).

7. Implications and Future Directions¶

How this changes practice for multi-reward RLHF-style training
A main implication is that “sum rewards then run GRPO” can be a poor default because it can erase distinctions between different multi-objective outcomes (Section 2, Figure 2).
GDPO offers a simple drop-in replacement at the advantage computation step that better preserves per-objective information and improves stability (Section 3.1, Figures 4–5, Tables 1–5).
Follow-up research suggested by the paper’s findings
More principled handling of objective difficulty imbalance, beyond ad hoc weight changes, is motivated by the observation that easy rewards dominate early training (Figure 5 discussion; Section 3.2).
Further exploration of reward-structure design (e.g., conditioning strategies, thresholds t in Eq. (8)) could be systematized for different domains and reward distributions.
Practical applications / downstream use cases
Any RL fine-tuning scenario where you must jointly optimize correctness + constraints, including:
- Tool calling where outputs must be both correct and correctly structured (Section 4.1).
- Reasoning models where you want accuracy but also bounded output length (Section 4.2).
- Code generation where you want pass rate, brevity, and fewer runtime/compile failures (Section 4.3).
Repro/Integration Guidance: when to prefer GDPO over alternatives (based on this paper)
Prefer GDPO over naive multi-reward GRPO when:
- You have multiple heterogeneous rewards (different ranges/discreteness), and especially when rewards are coarse (binary/thresholded), because collapse is more likely (Section 2, Figure 2).
- You observe training instability or late-stage degradation under GRPO in multi-reward setups (Figure 5 shows such a pattern in math reasoning).
- You scale to more objectives (3-reward coding setting shows benefits; Table 5).
Use batch-wise advantage normalization (Eq. (6)) as part of GDPO:
- The paper’s ablation indicates removing it can cause occasional failure (Appendix A, Figure 8).
For preference priority:
- Start with weights (Eq. (7)) but do not expect reliable priority enforcement when objective difficulties differ (Section 4.2.1, Figure 6).
- Consider conditioning (Eq. (8)) to enforce “must be correct before getting length/format reward,” as demonstrated for length control (Section 4.2.1, Figure 7, Table 4).