GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization¶
ArXiv: 2601.05242
🎯 Pitch¶
The paper introduces GDPO, a simple policy-optimization modification that decouples group-wise normalization across individual reward components and then applies batch-wise advantage normalization—preserving fine-grained multi-reward distinctions that GRPO’s summed-normalization collapses. By restoring signal resolution, GDPO yields substantially more stable training and better trade-offs between correctness and constraint adherence across tool-calling, math, and coding RL tasks, making multi-objective alignment of language models more reliable and effective.
1. Executive Summary (2-3 sentences)¶
GDPO is a multi-reward reinforcement learning (RL) policy-optimization method that fixes a failure mode in applying GRPO to multiple heterogeneous rewards. The core idea is to normalize each reward component separately within each prompt’s rollout group, then aggregate, and finally apply a batch-wise advantage normalization to keep update magnitudes stable as the number of rewards grows. Across tool-calling, math reasoning, and coding reasoning tasks, this yields more stable training and better trade-offs between correctness and constraint adherence than summing rewards and running standard GRPO.
2. Context and Motivation¶
- Problem / gap addressed
- RL alignment pipelines for language models increasingly optimize multiple rewards (e.g., correctness + formatting + length + safety constraints).
-
Recent work often defaults to
GRPOby summing reward components and then applying group-relative normalization, without checking whether that normalization preserves multi-reward signal resolution. -
Why it matters
- Multi-reward settings require the optimizer to distinguish how different objectives are satisfied (e.g., “correct and well-formatted” vs. “only correct”).
-
If the advantage signal loses these distinctions, policy updates can become inaccurate, hurting convergence and sometimes causing training collapse (the paper highlights instability in later training for GRPO in math/length experiments; see Figure 5 discussion).
-
Prior approach and its shortcoming
- Standard multi-reward practice: compute a scalar reward as a weighted sum (here, often an unweighted sum) and run
GRPO:- Sum rewards (Eq. (1)), then group-normalize the sum to compute advantage (Eq. (2)), then run a PPO-style clipped objective (Eq. (3)).
-
Key shortcoming identified: group-wise normalization over the sum can compress distinct reward combinations into identical advantage values (“reward signal collapse”), reducing the “resolution” of learning signals (Section 2, Figure 2).
-
How this paper positions itself
- Instead of proposing new reward functions, it asks a more basic question: is
GRPOsuitable for multi-reward optimization as commonly used? - It proposes
GDPOas a minimal modification toGRPOthat is specifically tailored to multi-reward training by changing how normalization is done (Section 3.1, Figure 1a).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a policy optimization algorithm for RL fine-tuning language models using multiple reward signals.
- It solves multi-objective alignment by changing the advantage normalization step so that each reward dimension preserves its own relative differences before rewards are combined.
3.2 Big-picture architecture (diagram in words)¶
- (1) Rollout sampler: for each prompt/question
q_i, sample a group ofGresponses{o_j}from the current/behavior policyπ_{θ_old}. - (2) Reward evaluators: compute
nreward componentsr_k^{(i,j)}for each rollout (e.g., correctness, format, length, bug-free). - (3) Advantage computation:
- Baseline (
GRPO): sum rewards → group-normalize summed reward. - Proposed (
GDPO): group-normalize each reward component → sum normalized components → batch-normalize final advantages. - (4) Policy update: apply a clipped policy-gradient objective using token-level importance ratios (Eq. (3)), optionally with a KL term (paper notes it is omitted in the displayed equation).
3.3 Roadmap for the deep dive¶
- Explain
GRPO’s multi-reward formulation and where normalization enters (Eqs. (1)–(3)). - Show the reward collapse mechanism with the paper’s small discrete example (Section 2, Figure 2).
- Present
GDPO’s decoupled normalization and batch-wise normalization (Eqs. (4)–(6)). - Describe how reward priorities are handled via weights and conditioned rewards (Eqs. (7)–(8)).
- Summarize the concrete training/evaluation setups and what changes in practice across tasks (Section 4, Tables 1–5, Tables 6–7).
3.4 Detailed, sentence-based technical breakdown¶
This is an algorithmic contribution with empirical evaluation: it modifies GRPO’s advantage estimation to better support multi-reward RL and validates the change across multiple tasks.
3.4.1 What GRPO does in the common multi-reward setup (baseline in this paper)¶
- The training data contains prompts/questions
q_i, and for each prompt the behavior policyπ_{θ_old}samples a group ofGrollouts/responses{o_j}_{j=1..G}(Section 2). - In a multi-reward setting with
nobjectives, the common approach builds a single scalar reward by summing reward components: $$ r_{\text{sum}}^{(i,j)} = r_1^{(i,j)} + \cdots + r_n^{(i,j)} \tag{1} $$ GRPOthen computes a group-relative advantage by normalizing the summed reward within the group for the same prompt: $$ A_{\text{sum}}^{(i,j)} = \frac{ r_{\text{sum}}^{(i,j)} - \operatorname{mean}{r_{\text{sum}}^{(i,1)},\dots,r_{\text{sum}}^{(i,G)}} }{ \operatorname{std}{r_{\text{sum}}^{(i,1)},\dots,r_{\text{sum}}^{(i,G)}} } \tag{2} $$- The policy is updated with a PPO-style clipped objective that operates at the token level (Eq. (3)), using an importance ratio:
$$
s_{i,t}(\theta)=
\frac{\pi_\theta(o^j_t \mid q, o^j_{<t})}{\pi_{\theta_\text{old}}(o^j_t \mid q, o^j_{<t})}
$$
and then optimizing the clipped surrogate (the paper omits the KL term “for clarity”):
$$
\mathcal{J}{\text{GRPO}}(\theta)=
\mathbb{E}\left[
\frac{1}{G}\sum
\min\Big(s_{i,t}(\theta)A_{\text{sum}}^{(i,j)}, \ \operatorname{clip}(s_{i,t}(\theta),1-\epsilon,1+\epsilon)\,A_{\text{sum}}^{(i,j)}\Big)
\right] \tag{3}
$$
where }^G \frac{1}{|o_j|}\sum_{t=1}^{|o_j|
εis the clipping threshold.
3.4.2 Why summing first + group-normalizing can collapse the training signal¶
- The paper’s key observation is that different combinations of component rewards can map to the same normalized advantage after Eq. (2), especially when rewards are discrete/coarse or the group size is small (Section 2, Figure 2).
- Worked micro-example (the one used by the paper, Section 2 / Figure 2):
- Setting:
G=2rollouts per prompt, andn=2binary rewardsr_1, r_2 ∈ {0,1}. - Each rollout’s total reward is in
{0,1,2}. - Consider a group where rollout totals are
(0,1):- Mean is
0.5, standard deviation is0.7071, so normalized advantages are approximately(-0.7071, +0.7071).
- Mean is
- Now consider totals
(0,2):- Mean is
1, standard deviation is1.4142, so normalized advantages are also(-0.7071, +0.7071).
- Mean is
- In other words, a group where one rollout satisfies both rewards (total
2) produces the same advantage pattern as a group where it satisfies only one reward (total1), losing a distinction that should matter for learning signal strength (Section 2 explanation). - The paper generalizes this “collapse” intuition by counting the number of distinct advantage groups that can arise under different numbers of rollouts/rewards, showing
GDPOpreserves substantially more distinct advantage values than bothGRPOand aGRPOvariant without standard deviation normalization (Figure 3).
3.4.3 Why “GRPO without std normalization” is not enough¶
- The paper discusses a variant used in some recent works where the standard deviation term is removed: $$ A_{\text{sum}}^{(i,j)} = r_{\text{sum}}^{(i,j)} - \operatorname{mean}(\cdot) $$ (described in Section 2; equation form not re-labeled there).
- This can separate some cases (e.g.,
(0,1)vs(0,2)become different), but the paper reports: - Only modest improvement in distinct advantage diversity when scaling rollouts/rewards (Figure 3).
- Empirically, it does not improve convergence and can be unstable: in tool-calling, the “GRPO w/o std” variant achieves 0% correct format on BFCL-v3 despite reasonable correctness reward curves (Section 4.1.1, Table 2, Figure 4).
3.4.4 GDPO: decouple normalization per reward, then stabilize scale with batch normalization¶
GDPOchanges only the advantage computation, leaving the overall policy-gradient update form similar in spirit toGRPO.- Step 1: for each reward component
k ∈ {1..n}, compute a per-reward group-normalized advantage within the rollout group for the same prompt: $$ A_k^{(i,j)} = \frac{ r_k^{(i,j)} - \operatorname{mean}{r_k^{(i,1)},\dots,r_k^{(i,G)}} }{ \operatorname{std}{r_k^{(i,1)},\dots,r_k^{(i,G)}} } \tag{4} $$ - Step 2: aggregate across objectives by summing the normalized components: $$ A_{\text{sum}}^{(i,j)} = A_1^{(i,j)} + \cdots + A_n^{(i,j)} \tag{5} $$
- Step 3: apply batch-wise normalization over all rollouts in the current training batch to keep the advantage scale stable as
ngrows: $$ \hat{A}{\text{sum}}^{(i,j)}= \frac{ A,\ j'=1..G} }{ \operatorname{std}{A_{\text{sum}}^{(i',j')}\mid i'\in D_{\text{Batch}},\ j'=1..G}+\epsilon } \tag{6} $$}}^{(i,j)} - \operatorname{mean}{A_{\text{sum}}^{(i',j')}\mid i'\in D_{\text{Batch} - Here
εis an additive stabilizer in the denominator (distinct from the PPO clip thresholdεin Eq. (3); the paper uses the same symbol, so this interpretation is necessary to avoid confusion). - Mechanistic intuition (how it fixes the collapse shown in Figure 2):
- If you normalize only the sum, you can’t tell whether a high sum came from one reward being high vs. multiple rewards being jointly satisfied, once everything is standardized within a group.
- If you normalize each reward separately, then differences along each reward axis remain visible in the aggregated advantage (Section 3.1, Figure 2 discussion).
- Stability note:
- The paper states batch-wise normalization improves stability and shows that removing it can cause occasional convergence failures (Appendix A, Figure 8).
3.4.5 Handling different objective priorities: weights vs conditioned rewards¶
The paper treats this as part of “effective incorporation of priority variation” (Section 3.2).
- Weighted objectives
- Standard approach: choose weights
w_kto reflect priority and form a weighted sum. InGDPO, weights apply to the normalized per-reward advantages: $$ A_{\text{sum}}^{(i,j)} = w_1 A_1^{(i,j)} + \cdots + w_n A_n^{(i,j)} \tag{7} $$ -
Practical issue reported: if one reward is much easier, the model may optimize it “regardless of assigned weights” unless weights become extremely imbalanced (Section 3.2; examined in Section 4.2.1).
-
Conditioned rewards (gating an easy reward on a hard reward)
- To prevent an easy objective from dominating, the paper describes conditioning reward
r_kon another rewardr_lmeeting a thresholdt: $$ r_k= \begin{cases} r_k,& \text{if } r_l \ge t\ 0,& \text{otherwise} \end{cases} \tag{8} $$ - Instantiation in math reasoning (Section 4.2.1): replace the length reward with a conditioned length reward that only pays out if the answer is correct:
\tilde{R}_{length} = 1iff length ≤landR_correct = 1, else0.
- The paper claims this makes weight tuning behave more predictably afterward (Figure 6 discussion; Tables 8–9).
3.4.6 System/data pipeline “what happens first, second, third” (end-to-end training loop)¶
Putting the algorithm into an operational pipeline consistent with the paper’s setups:
- Sample a minibatch of prompts from the RL training dataset (e.g., tool-calling mixture totaling 4k prompts; math dataset 40k prompts; coding dataset 24k prompts).
- For each prompt, sample
Grollouts from the current policy (e.g.,G=4in tool calling;G=16in math/coding). - Compute multiple reward components for each rollout using task-specific reward functions:
- Tool calling:
R_format ∈ {0,1}andR_correct ∈ [-3,3](Section 4.1; Appendix C). - Math:
R_length ∈ {0,1}andR_correct ∈ {0,1}(Section 4.2). - Coding:
R_pass ∈ [0,1], conditioned\tilde{R}_{length} ∈ {0,1}, andR_bug ∈ {0,1}(Section 4.3). - Compute advantages:
- Baseline: sum rewards then group-normalize (
GRPO, Eq. (2)). - Proposed: group-normalize per reward (Eq. (4)), sum (Eq. (5)), batch-normalize (Eq. (6)).
- Update the policy using the clipped objective over tokens (Eq. (3)) under the training framework (
verlis used throughout the experiments described). - Evaluate periodically on downstream benchmarks (BFCL-v3 for tool calling; AIME/AMC/MATH/Minerva/OlympiadBench for math; PRIME validation tasks for coding) with specified decoding settings.
3.4.7 Core configurations / hyperparameters reported in the paper (and what is missing)¶
The paper provides partial training hyperparameters (Tables 6–7), plus rollout and evaluation settings in Section 4. Key reported items:
- Tool calling training (Section 4.1; Table 6)
- Models:
Qwen2.5-Instruct-1.5BandQwen2.5-Instruct-3B. - Steps: “100 steps” in the narrative (Section 4.1).
- Rollouts per prompt:
n=4(Table 6:actor_rollout_ref.rollout.n = 4). - Batch size:
512(Table 6). - Max response length:
1024tokens (Section 4.1). - Max prompt length:
2048(Table 6). - Learning rate:
1e-6(Table 6). - KL coefficient:
0.001(Table 6). - Total epochs:
15(Table 6). - PPO minibatch size:
128(Table 6). -
Missing from provided content: optimizer type (e.g., AdamW), beta/epsilon values, LR schedule, model architecture details (layers/hidden size/heads), tokenizer/context window beyond max prompt/response limits, hardware/compute budget.
-
Math and coding training (Section 4.2 / 4.3; Table 7)
- Models:
DeepSeek-R1-1.5B,DeepSeek-R1-7B,Qwen3-4B-Instruct. - Math steps:
500 steps(Section 4.2). - Coding steps:
400 steps(Section 4.3). - Rollouts per prompt:
16(Table 7:actor_rollout_ref.rollout.n = 16). - Batch size:
512(Table 7). - Max prompt length:
1024(Table 7). - Learning rate:
1e-6(Table 7). - Rollout temperature during training:
1(Table 7). - PPO epochs:
1(Table 7). - PPO minibatch size:
64(Table 7). - Clip ratios: low
0.2, high0.28(Table 7). - KL loss: coefficient
0.0005, typemse(Table 7). - Group filtering enabled with metric
seq_reward(Table 7:algorithm.filter_groups.enable TRUE). - Length target:
l = 4000tokens (Section 4.2). -
Missing from provided content: same as above—optimizer details, schedules, compute/hardware, and model architectural hyperparameters.
-
Inference/evaluation decoding (Section 4.2 and 4.3)
- Backend:
vLLM. - Temperature:
0.6. top_p = 0.95.- Max response length:
32ktokens. - Samples per question:
16for evaluation.
4. Key Insights and Innovations¶
- (1) Identification of multi-reward “advantage collapse” in naive GRPO
- Novelty: the paper isolates a concrete mechanism where group-normalizing an aggregated reward sum maps many distinct reward combinations to identical normalized advantages (Section 2, Figure 2).
-
Significance: this is not just lower performance; it can reduce training signal expressiveness and contribute to instability (the paper connects GRPO to declining correctness after ~400 steps in math training; Figure 5 narrative).
-
(2)
GDPO: group-wise decoupled normalization per reward - Novelty: compute per-reward group-normalized advantages (Eq. (4)) before combining objectives.
-
Significance: increases the number of distinct advantage values available to learning as rollouts/rewards scale (Figure 3), improving multi-reward optimization fidelity.
-
(3) Batch-wise normalization of aggregated multi-reward advantage
- Novelty (relative to just “normalize per reward”): the explicit additional normalization step (Eq. (6)) intended to keep advantage scale independent of the number of rewards.
-
Significance: empirically linked to training stability; removing it can cause occasional convergence failures (Appendix A, Figure 8).
-
(4) Practical guidance on priority handling: weighting vs conditioning
- The paper argues reward weights alone may not reflect intended priorities when objective difficulties differ (Section 3.2, Figure 6 discussion).
- Conditioning an easy reward on a harder one (Eq. (8)) is presented as a more forceful mechanism to enforce priority ordering (Section 4.2.1, Figure 7, Table 4).
5. Experimental Analysis¶
Evaluation methodology (tasks, datasets, metrics, baselines)¶
- Baselines compared
GRPOwith summed reward (standard baseline throughout).GRPO w/o std(tool calling ablation; Section 4.1.1).-
GDPO(proposed). -
Task 1: Tool calling (Section 4.1)
- Training data: 4k prompts total:
2kToolACE,1kHammar,1kxLAM (Section 4.1). - Rewards:
R_format ∈ {0,1}checks tag/order structure (Appendix C, Eq. (9)).R_correct ∈ [-3,3]based on tool name/parameter matching against ground truth (Appendix C).
- Training setup: 4 rollouts, batch size 512, max response length 1024, trained for 100 steps, 5 runs (Section 4.1; Table 6).
-
Evaluation benchmark:
BFCL-v3, metrics include “Avg Acc” and “Correct Format” plus subcategories (Table 1). -
Task 2: Math reasoning (Section 4.2)
- Training data:
DeepScaleR-Previewdataset with40kcompetition-level problems (Section 4.2). - Rewards:
R_length ∈ {0,1}for staying withinl=4000tokens.R_correct ∈ {0,1}for final answer match.
- Training setup: 16 rollouts, batch size 512, max response length 8000, 500 steps (Section 4.2; Table 7).
-
Evaluation benchmarks:
AIME-24,AMC 2022/2023,MATH,Minerva,OlympiadBench, reporting pass@1 and “Exceed” (% outputs exceeding 4000 tokens) (Section 4.2; Table 3). -
Task 3: Coding reasoning (Section 4.3)
- Training data:
Eurus-2-RLwith24kcoding problems (Section 4.3). - Rewards:
R_pass ∈ [0,1]= fraction of test cases passed.- Conditioned
\tilde{R}_{length}pays out only if length ≤landR_pass = 1. R_bug ∈ {0,1}for no runtime/compilation error.
- Training setup: 400 steps, “same hyperparameter configuration used in the mathematical reasoning experiments” (Section 4.3; Table 7 provides that configuration).
- Evaluation: PRIME validation tasks (
Apps,CodeContests,Codeforces,Taco), metrics: Pass, Exceed, Bug (Table 5).
Main quantitative results (with specific numbers)¶
- Tool calling:
GDPO>GRPOon both accuracy and formatting (Table 1) Qwen2.5-Instruct-1.5B:- Avg Acc:
GRPO 30.18%→GDPO 32.81% - Correct Format:
GRPO 76.33%→GDPO 80.66%
- Avg Acc:
Qwen2.5-Instruct-3B:- Avg Acc:
GRPO 39.20%→GDPO 40.87% - Correct Format:
GRPO 81.64%→GDPO 82.23%
- Avg Acc:
-
Training curves:
GDPOconverges to higher median reward values for both correctness and format (Figure 4; also Figure 1b for 1.5B tool-calling). -
Tool calling ablation: “GRPO w/o std” is unstable for formatting (Section 4.1.1, Table 2, Figure 4)
-
Qwen2.5-1.5B-Instruct:- Avg Acc:
GRPO 30.18%,GRPO w/o std 29.26%,GDPO 32.81% - Correct Format:
GRPO 76.33%,GRPO w/o std 0%,GDPO 80.66%
- Avg Acc:
-
Math reasoning:
GDPOtends to improve accuracy while dramatically reducing length violations (Table 3) DeepSeek-R1-1.5B:- AIME Acc:
GRPO 23.1%→GDPO 29.4%(increase of+6.3%as also stated in the intro) - AIME Exceed:
GRPO 10.8%→GDPO 6.5% - MATH Acc:
GRPO 83.6%→GDPO 86.2% - MATH Exceed:
GRPO 1.5%→GDPO 0.8%
- AIME Acc:
DeepSeek-R1-7B:- AIME Acc:
GRPO 50.2%→GDPO 53.1% - AIME Exceed:
GRPO 2.1%→GDPO 0.2%
- AIME Acc:
Qwen3-4B-Instruct:- AIME Acc:
GRPO 54.6%→GDPO 56.9%(increase of+2.3%as stated in the intro) - AIME Exceed:
GRPO 2.5%→GDPO 0.1%
- AIME Acc:
-
Training dynamics: both methods quickly maximize the easy length reward early, but
GDPOrecovers correctness and remains stable, whileGRPOcorrectness declines after ~400 steps and max response length rises (Figure 5). -
Math priority experiments: weights alone behave inconsistently; conditioning changes behavior (Section 4.2.1)
- Figure 6 shows accuracy/exceed trends under varying length reward weights with and without conditioning; the text highlights that decreasing
w_lengthfrom0.75to0.5often barely changes exceed rates, implying weights don’t reliably impose priority when difficulties differ. -
With conditioned length reward, the paper reports more predictable monotonic changes when adjusting
\tilde{w}_{length}(Figure 6, plus Tables 8–9 in Appendix G). -
Coding reasoning:
GDPOimproves multi-objective trade-offs in both 2-reward and 3-reward settings (Table 5) - Two-objective (Pass + conditioned length),
DeepSeek-R1-7B:CodeContestsPass:GRPO2-obj 63.2%→GDPO2-obj 65.8%TacoPass:GRPO2-obj 45.1%→GDPO2-obj 48.4%
- Three-objective (Pass + conditioned length + Bug):
CodeforcesBug:GRPO3-obj 2.5%→GDPO3-obj 1.8%AppsExceed:GRPO3-obj 11.2%→GDPO3-obj 8.5%- Pass is similar between
GRPO3-objandGDPO3-objacross tasks, whileGDPO3-objoften reduces Exceed and Bug.
Do the experiments support the claims?¶
- The tool-calling and math experiments provide direct evidence that
GDPOimproves convergence and final metrics overGRPOunder the paper’s settings (Tables 1 and 3; Figures 4 and 5). - The “GRPO w/o std” ablation strengthens the argument that simply tweaking normalization is not sufficient and can introduce instability (Table 2, Figure 4).
- The coding experiments indicate the method continues to work when scaling from 2 to 3 rewards (Table 5), aligning with the paper’s generalization claim.
Ablations, failure cases, robustness checks¶
- Ablation:
GRPO w/o std(Section 4.1.1) andGDPO w/o BN(Appendix A). - Failure/stability:
GDPOwithout batch-wise normalization sometimes fails to converge (Appendix A, Figure 8). - The paper reports multiple-run statistics for tool-calling (5 runs, median + IQR bands in Figures 1b and 4), which partially addresses variance.
6. Limitations and Trade-offs¶
- Incomplete reporting of training stack details
- The provided content includes key RL hyperparameters (batch size, rollouts, clip ratios, LR, KL coefficients) but does not specify optimizer type, LR schedule, or hardware/compute budget (Tables 6–7).
-
This makes exact reproduction harder without consulting the referenced configurations/appendices beyond what is shown here.
-
Dependence on non-degenerate per-reward variance
- Per-reward group normalization (Eq. (4)) divides by the reward’s standard deviation within a rollout group. If a reward is constant across the group for a prompt, the std is zero; the paper does not explicitly state how Eq. (4) is stabilized in that case in the excerpt shown (whereas Eq. (6) explicitly adds
+ ε). -
Practically, an implementation must guard against zero-variance rewards per group; the paper’s text does not spell out the exact handling for Eq. (4).
-
Conditioning can change the feasible objective set
-
Conditioning an easy reward on a hard reward (Eq. (8)) enforces priority but can also withhold learning signal for the easy objective until the hard objective is met, which may slow learning of that objective in early phases (this is a conceptual trade-off; the paper emphasizes the priority benefit in Section 4.2.1).
-
Scope of evaluation
-
The experiments cover three application types (tools, math, coding), but all are within the LLM RL fine-tuning regime described. The paper does not evaluate, in the provided content, settings like continuous-control RL, non-language policies, or reward models with very different statistical properties.
-
Additional normalization layers
GDPOintroduces more normalization steps (per reward + batch-wise). While empirically helpful here, extra normalization can sometimes interact with reward scaling/weighting in non-obvious ways; the paper partially addresses this with the batch-wise normalization motivation and ablation (Eq. (6), Appendix A).
7. Implications and Future Directions¶
- How this changes practice for multi-reward RLHF-style training
- A main implication is that “sum rewards then run
GRPO” can be a poor default because it can erase distinctions between different multi-objective outcomes (Section 2, Figure 2). -
GDPOoffers a simple drop-in replacement at the advantage computation step that better preserves per-objective information and improves stability (Section 3.1, Figures 4–5, Tables 1–5). -
Follow-up research suggested by the paper’s findings
- More principled handling of objective difficulty imbalance, beyond ad hoc weight changes, is motivated by the observation that easy rewards dominate early training (Figure 5 discussion; Section 3.2).
-
Further exploration of reward-structure design (e.g., conditioning strategies, thresholds
tin Eq. (8)) could be systematized for different domains and reward distributions. -
Practical applications / downstream use cases
-
Any RL fine-tuning scenario where you must jointly optimize correctness + constraints, including:
- Tool calling where outputs must be both correct and correctly structured (Section 4.1).
- Reasoning models where you want accuracy but also bounded output length (Section 4.2).
- Code generation where you want pass rate, brevity, and fewer runtime/compile failures (Section 4.3).
-
Repro/Integration Guidance: when to prefer
GDPOover alternatives (based on this paper) - Prefer
GDPOover naive multi-rewardGRPOwhen:- You have multiple heterogeneous rewards (different ranges/discreteness), and especially when rewards are coarse (binary/thresholded), because collapse is more likely (Section 2, Figure 2).
- You observe training instability or late-stage degradation under
GRPOin multi-reward setups (Figure 5 shows such a pattern in math reasoning). - You scale to more objectives (3-reward coding setting shows benefits; Table 5).
- Use batch-wise advantage normalization (Eq. (6)) as part of
GDPO:- The paper’s ablation indicates removing it can cause occasional failure (Appendix A, Figure 8).
- For preference priority:
- Start with weights (Eq. (7)) but do not expect reliable priority enforcement when objective difficulties differ (Section 4.2.1, Figure 6).
- Consider conditioning (Eq. (8)) to enforce “must be correct before getting length/format reward,” as demonstrated for length control (Section 4.2.1, Figure 7, Table 4).