Skip to content

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

ArXiv: 2512.01374

🎯 Pitch

This paper formulates token-level policy gradients as a first-order approximation to the true sequence-level reward and identifies the two key failure modes—training–inference discrepancy and policy staleness—that determine when that approximation is valid. By deriving why importance-sampling correction, clipping, and Routing Replay (for MoE models) restore the approximation and then verifying these recipes at scale on a 30B MoE, the work provides practical, theoretically grounded methods to prevent RL collapse and enable stable, scalable RL for LLMs.


1. Executive Summary (2-3 sentences)

This paper explains when it is mathematically justified to train LLMs with sequence-level rewards (one scalar per whole response) using the token-level policy-gradient objectives that are common in practice, by framing the token-level objective as a first-order approximation to the true sequence-level objective (Section 2.3, Eq. (1)–(4)). It argues that the approximation becomes reliable only when two gaps are controlled: training–inference discrepancy and policy staleness (Section 2.4, Eq. (5)), and it uses this lens to motivate stabilization tools such as importance sampling correction, clipping, and Routing Replay for MoE models (Section 3). Extensive experiments on a 30B MoE model show which combinations prevent training collapse under on-policy vs off-policy updates (Section 4; Figures 1–4).


2. Context and Motivation

  • Problem/gap addressed.
  • RL for LLMs often uses sequence-level rewards (a single score for the entire generated response), but many practical algorithms optimize a token-level objective (optimize per-token log-probabilities weighted by an advantage/reward) (Introduction; Section 2).
  • This creates a conceptual mismatch: Why should token-level optimization improve a sequence-level objective, and when does it fail? The paper targets this mismatch as a root cause of instability (Introduction; Section 2.3–2.4).

  • Why this matters.

  • The paper defines “stable training” as steady improvement in reward/benchmarks and smooth evolution of internal diagnostics like entropy and training–inference KL, avoiding abrupt shifts or collapse (footnote in Introduction; Section 4.2 describes the diagnostics).
  • Stability is positioned as necessary for “successfully scaling RL,” meaning running longer or multi-stage RL without collapse (Introduction).

  • Prior approaches and shortcomings (as framed here).

  • Mainstream policy-gradient-style RL methods (named examples include REINFORCE and GRPO) “typically employ token-level optimization objectives,” even when the reward is sequence-level (Introduction).
  • Some work suggests “directly adopting sequence-level optimization objectives,” but this paper emphasizes why the sequence-level gradient is often intractable due to the huge variance/numerical range of sequence likelihood ratios (Section 2.2, Eq. (2)).
  • For Mixture-of-Experts (MoE) models, token-level importance sampling can break because expert routing changes which parameters are active per token, and routing can differ between inference and training engines (Introduction; Section 3.1).

  • Positioning relative to existing work.

  • The paper’s stance is not “token-level is always correct,” but: token-level objectives can be seen as a first-order approximation to the sequence-level objective, and stabilization techniques work insofar as they help the approximation’s conditions hold (Section 2.3–2.4; Section 3; Section 4.3–4.4).
  • It also explicitly cautions that Routing Replay can introduce bias by changing the effective target policy (Section 3.2; Table 1), so stabilization is traded against fidelity to the “natural” MoE policy.

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a reinforcement-learning training setup for LLMs where the model generates full responses, receives one reward per response, and updates the model using policy gradients.
  • It solves the problem of making token-level RL objectives sound and stable for sequence-rewarded LLMs, especially when rollouts are produced by a separate inference engine and when the model is an MoE with dynamic routing.

3.2 Big-picture architecture (diagram in words)

  • Prompt set D → sampled prompts x.
  • Inference engine rollout policy μ_{θ_old} generates responses y (and, for MoE, routed-expert decisions per token) → reward function computes R(x, y).
  • Training engine target policy π_θ consumes (x, y, R, metadata) and performs gradient updates using a token-level surrogate objective with:
  • importance sampling (IS) correction,
  • optional clipping/masking of updates,
  • optional Routing Replay to fix MoE routing during optimization.
  • Updated θ becomes the next step’s θ_old in a synchronous loop (Section 4.2).

3.3 Roadmap for the deep dive

  • Explain the true objective the training is trying to maximize (sequence-level expected reward) and why its direct gradient is impractical (Section 2.2).
  • Derive the token-level surrogate objective and show the first-order approximation argument (Section 2.3).
  • Decompose “closeness” between rollout and target policies into training–inference discrepancy and policy staleness (Section 2.4; Eq. (5)).
  • Show why MoE expert routing makes these discrepancies worse and can break token-level IS logic (Section 3.1; Eq. (6)).
  • Describe Routing Replay variants (R2/R3) and their bias/stability trade-off (Section 3.2; Table 1).
  • Connect the theory to the concrete algorithm MiniRL and the empirical recipes under on-policy vs off-policy updates (Section 4; Eq. (7); Figures 1–4).

3.4 Detailed, sentence-based technical breakdown

Framing. This is primarily an algorithmic + empirical systems paper: it introduces a formulation that interprets common token-level policy-gradient training as a first-order approximation to a desired sequence-level RL objective, then tests stabilization practices at large scale (Section 2–4).

3.4.1 The objective the paper wants: expected sequence-level reward

  • The LLM is treated as an autoregressive policy π_θ that assigns probability to a response y = (y_1, …, y_|y|) given prompt x via a product of token probabilities (Section 2.1).
  • The “true” objective is the expected reward over prompts and full sampled responses:
  • J_seq(θ) = E_{x ~ D, y ~ π_θ(·|x)} [ R(x, y) ] (Section 2.2).
  • Because responses are sampled in an inference engine using a rollout policy μ_{θ_old}, the paper rewrites the expectation using importance sampling:
  • J_seq(θ) = E_{x ~ D, y ~ μ_{θ_old}} [ (π_θ(y|x) / μ_{θ_old}(y|x)) · R(x, y) ] (Eq. (1)).
  • The gradient becomes a REINFORCE-like expression with a sequence-level likelihood ratio:
  • ∇_θ J_seq(θ) = E [ (π_θ(y|x)/μ_{θ_old}(y|x)) · R(x,y) · ∑_t ∇_θ log π_θ(y_t|x,y_<t) ] (Eq. (2)).
  • The paper’s key practical claim here is that this is “usually intractable” because π_θ(y|x) and μ_{θ_old}(y|x) are products over many tokens, producing extremely large/small numbers and high variance (Section 2.2).

3.4.2 The surrogate used in practice: token-level objective with token-level IS

  • Instead of the sequence-level ratio, the paper studies a surrogate objective that sums token-level IS ratios:
  • J_token(θ) = E_{x, y ~ μ_{θ_old}} [ ∑_{t=1}^{|y|} (π_θ(y_t|x,y_<t) / μ_{θ_old}(y_t|x,y_<t)) · R(x,y) ] (Eq. (3)).
  • Its gradient is:
  • ∇_θ J_token(θ) = E [ ∑_t (π_θ(y_t|·)/μ_{θ_old}(y_t|·)) · R(x,y) · ∇_θ log π_θ(y_t|·) ] (Eq. (4)).
  • The paper identifies Eq. (4) as “the basic policy gradient algorithm (REINFORCE) equipped with a token-level IS weight” (Section 2.3).

3.4.3 Why the surrogate can make sense: first-order approximation argument

  • The paper’s central derivation is that the sequence-level ratio
  • π_θ(y|x) / μ_{θ_old}(y|x) = ∏_t (π_θ(y_t|·)/μ_{θ_old}(y_t|·)) can be approximated to first order when each token ratio is close to 1 (Section 2.3).
  • It sets π_θ(y_t|·)/μ_{θ_old}(y_t|·) = 1 + δ_t with “small” δ_t, and uses:
  • ∏_t (1 + δ_t) ≈ 1 + ∑_t δ_t when ignoring second-order terms like δ_i δ_j (Section 2.3).
  • Under this approximation, the paper shows:
  • ∇_θ J_seq(θ) ≈ ∇_θ J_token(θ) (end of Section 2.3).

Worked micro-example (to make the approximation concrete). - Suppose a 2-token response (|y|=2) and token ratios are 1+δ_1 and 1+δ_2. - The exact sequence ratio is (1+δ_1)(1+δ_2) = 1 + δ_1 + δ_2 + δ_1δ_2. - The first-order approximation drops δ_1δ_2, leaving ≈ 1 + δ_1 + δ_2, which corresponds to “add up per-token deviations” rather than “multiply them,” matching the surrogate’s structure (Section 2.3).

3.4.4 When the approximation holds: two sources of “policy mismatch”

  • The approximation requires the rollout policy μ_{θ_old} and target policy π_θ to be “close” (Section 2.4).
  • The paper makes this “closeness” interpretable by decomposing each token-level ratio into two factors (Eq. (5)):
  • Training–inference discrepancy: π_{θ_old}(y_t|·) / μ_{θ_old}(y_t|·)
    • This is the mismatch between probabilities computed by the training engine vs the inference engine even at the same parameters θ_old (Section 2.4; Eq. (5)).
  • Policy staleness: π_θ(y_t|·) / π_{θ_old}(y_t|·)
    • This is the mismatch introduced because the policy being optimized π_θ may have moved away from the rollout policy snapshot π_{θ_old} (Section 2.4; Eq. (5)).
  • The paper explains common causes:
  • Training–inference discrepancy can come from different kernels and nondeterminism/throughput settings across engines, and is amplified in MoE due to routing (Section 2.4).
  • Policy staleness increases when you reuse the same rollouts for multiple gradient updates (splitting into mini-batches), or in asynchronous setups where a single response may be generated by multiple policy versions (Section 2.4).

3.4.5 Why MoE makes it harder: routing entangles with both discrepancies

  • In an MoE model, each token activates only a subset of experts chosen by a router; thus the conditional probability depends on routed experts e_t (Section 3.1).
  • The paper rewrites the token IS ratio in MoE form (Eq. (6)), explicitly including routed experts for:
  • training engine routing vs inference engine routing (e^π vs e^μ),
  • old vs current routing (e_old vs e).
  • The key mechanism is:
  • If training vs inference engines choose different experts for the same token context, then even the “same model parameters” can represent different effective computations, amplifying training–inference discrepancy (Section 3.1).
  • As parameters change, routing decisions can shift, so policy staleness is not just “weights changed” but “which experts were used changed,” which can cause large effective policy jumps (Section 3.1).

3.4.6 Routing Replay: stabilize by fixing routing during optimization (with bias)

  • The paper introduces Routing Replay as a way to “optimize [an MoE] like a dense one” by fixing which experts are used during gradient updates (Section 3.2).
  • Two variants are formalized:

  • Vanilla Routing Replay (R2) replays the experts chosen by the rollout policy in the training engine, e^{π}_{old,t}, during gradient updates (Section 3.2).

    • Intended effect: reduce routing-induced policy staleness (Section 3.2, R2 equation block).
  • Rollout Routing Replay (R3) replays the experts chosen by the rollout policy in the inference engine, e^{μ}_{old,t}, inside the training engine (Section 3.2).

    • Intended effect: reduce training–inference discrepancy (align training computation with inference routing) and also reduce routing-induced staleness (Section 3.2, R3 equation block).
  • Bias trade-off.

  • The paper emphasizes that Routing Replay changes the effective target policy from the “naturally routed” π_θ to a constrained policy π^{R2}_θ or π^{R3}_θ (Section 3.2).
  • Table 1 highlights a specific difference under multiple mini-batch updates (off-policy reuse):
    • With R2, the first mini-batch can match the “natural” routing (e^{π}_{old,t} = e^{π}_t), so it does not alter the target policy in that first update, but later mini-batches do (Table 1).
    • With R3, even the first mini-batch uses e^{μ}_{old,t} ≠ e^{π}_t, so it alters the target policy from the start (Table 1).
  • The paper’s empirical hypothesis later is that this bias-vs-stability balance shifts with degree of off-policiness (Section 4.4).

3.4.7 The concrete training recipe studied: MiniRL objective and training loop

  • MiniRL baseline (Section 4.1). MiniRL starts from the token-level surrogate but adds two “minimal modifications”:
  • Group-normalization of rewards into an advantage estimate:
    • A_b(x,y) = R(x,y) − E_{y' ~ μ_{θ_old}(·|x)}[R(x,y')] (Section 4.1).
    • In this paper’s experiments, rewards are binary (Section 4.2), so this turns a {0,1} reward into a centered value relative to other samples for the same prompt.
  • Clipping/masking of tokens inspired by PPO, implemented as a mask M_t that can stop gradients when per-token ratio r_t = π_θ(y_t|·)/π_{θ_old}(y_t|·) moves beyond thresholds, with different conditions for positive vs negative advantage (Eq. (7)).

  • MiniRL objective (Eq. (7)).

  • The objective sums token log-probabilities weighted by:
    • a stopped-gradient token-level IS ratio π_θ(y_t|·)/μ_{θ_old}(y_t|·) (stopped gradient indicated by sg),
    • the advantage A_b(x,y),
    • and the clipping mask M_t.
  • The clipping thresholds are given in the setup: ε_high = 0.27, ε_low = 0.2 (Section 4.2).

  • Synchronous RL pipeline (Section 4.2).

  • In each global step:
    1. Sample B prompts.
    2. Generate G responses per prompt using inference-engine rollout policy μ_{θ_old}.
    3. Split all responses into N mini-batches.
    4. Apply N gradient updates in the training engine.
    5. Use the updated policy as next step’s rollout policy.
  • Across runs, mini-batch size is fixed to 1,024 responses per gradient update, achieved via B = 64, G = 16 (so 64 × 16 = 1024) (Section 4.2).

  • Truncated Importance Sampling (TIS).

  • The paper additionally truncates the token-level IS weight with threshold 5 (Section 4.2).
  • (The paper cites this as a stabilization trick; it does not provide a full derivation here beyond stating its use and threshold.)

  • Precision / stress test setting.

  • Inference uses FP8, training uses BF16, explicitly intended as a “stress test” where inference precision is lower and training–inference discrepancy is large (Section 4.2).

  • Missing configuration details (explicitly not provided in the excerpt).

  • The paper names a “30B MoE model” and base “Qwen3-30B-A3B-Base” (Section 4.2), but does not specify architecture hyperparameters such as number of layers, hidden size, attention heads, tokenizer, optimizer type/settings, learning-rate schedule, weight decay, or exact hardware (GPU type/count). Those details are therefore unavailable to summarize faithfully from the provided content.

4. Key Insights and Innovations

  • (1) Token-level RL as a first-order approximation to sequence-level RL (core formulation).
  • Novelty: Instead of treating token-level policy gradients as an ad hoc mismatch, the paper provides an explicit approximation linking ∇J_seq and ∇J_token when token-level ratios stay close to 1 (Section 2.3).
  • Significance: This gives a principled criterion for stability—keep the rollout/target mismatch small—rather than relying purely on empirically motivated heuristics.

  • (2) Decomposition of “mismatch” into two actionable terms: training–inference discrepancy and policy staleness.

  • Novelty: Eq. (5) splits the token-level IS ratio into two factors tied to real system behavior: differences across engines and differences across time/policy versions (Section 2.4).
  • Significance: It explains why stability issues can arise even in nominally “on-policy” training if the inference and training engines disagree numerically, and why reuse of rollouts (off-policy updates) can destabilize training.

  • (3) MoE-specific instability mechanism: expert routing breaks the approximation conditions.

  • Novelty: Eq. (6) makes routing part of the probability ratio and shows it interacts with both discrepancy sources (Section 3.1).
  • Significance: This gives an explanation for why MoE RL can be especially brittle: the “policy” includes routing decisions, so small numerical differences can flip experts and cause large effective policy changes.

  • (4) Routing Replay as a stabilization method with an explicit bias lens.

  • Novelty: The paper does not just recommend Routing Replay; it formalizes two variants (R2, R3) and emphasizes they define different target policies (π^{R2}_θ, π^{R3}_θ) (Section 3.2; Table 1).
  • Significance: This clarifies a real trade-off: better approximation validity (stability) can come at the cost of optimizing a biased objective (Section 3.2), and the best choice depends on off-policiness (Section 4.4).

  • (5) Empirical claim about cold-start initialization diminishing with stabilized, prolonged RL.

  • Novelty: The paper reports that with a stable recipe, different cold-start initializations converge to comparable final performance (Section 4.5; Figure 5).
  • Significance: If robust, this shifts attention from “getting the perfect cold start” toward “ensuring RL stability and running long enough.”

5. Experimental Analysis

5.1 Evaluation methodology

  • Task.
  • Mathematical reasoning with binary reward R(x,y) ∈ {0,1} computed by matching the model response to a ground-truth answer (Section 4.2).

  • Training prompts.

  • A curated set of 4,096 math problems with verified answers (Section 4.2).

  • Benchmarks and metric.

  • Benchmarks: HMMT25, AIME25, AIME24, each with 30 problems (90 total) (Section 4.2).
  • Metric: average accuracy over 32 sampled responses per problem (Section 4.2).
  • The plots often show an aggregated “Benchmark Score” curve (Figures 1–4) and detailed per-benchmark curves are provided (Appendix B; Figures 6–9).

  • Diagnostics used to assess stability.

  • Token-level entropy of the target policy (approximation formula given) (Section 4.2).
  • Training–inference KL divergence D_KL( μ_{θ_old} || π_{θ_old} ) computed on tokens sampled from the inference rollout (Section 4.2).
  • The paper motivates KL as an indicator because prior work found collapses often coincide with sharp increases in training–inference mismatch (Section 4.2).

  • Training framework and batching.

  • Standard synchronous RL loop (Section 4.2).
  • Mini-batch size per gradient update fixed to 1,024 responses (B=64, G=16) (Section 4.2).
  • Off-policy is induced by using a larger global batch split into multiple mini-batches (Section 4.2, 4.4).

  • Key hyperparameters explicitly given.

  • Max generation length: 32,768 tokens in main experiments (Section 4.2).
  • Clipping thresholds: ε_high = 0.27, ε_low = 0.2 (Section 4.2; Eq. (7)).
  • TIS truncation threshold: 5 (Section 4.2).
  • Precision: FP8 inference, BF16 training (Section 4.2).
  • Compute: “hundreds of thousands of GPU hours,” and an estimate of 5–6 GPU hours per gradient step (Section 4.2).

5.2 Main quantitative results (reported with the specificity available)

Because the paper provides results primarily as curves (Figures 1–5; Figures 6–10) without tabulated peak values, the safest faithful summary is comparative/qualitative, anchored to the plotted findings and stated conclusions.

On-policy setting (global batch size = mini-batch size)

  • Setup.
  • gbs = mbs = 1,024 (Figure 1; Section 4.3).

  • Methods compared.

  • MiniRL (with training–inference IS correction).
  • MiniRL + length-norm (length normalization).
  • MiniRL w/o train-infer-IS (removes IS correction for training–inference discrepancy).
  • Each also tested with R3 (Section 4.3; Figure 1).

  • Findings (Figure 1; Section 4.3 bullets).

  • MiniRL achieves the best benchmark score trajectory and is described as the most stable.
  • Adding length normalization produces worse benchmark performance while remaining stable, which the paper interprets as biasing the objective away from the first-order approximation target (Section 4.3).
  • Removing training–inference IS correction causes rapid collapse and a sharp entropy drop (Figure 1; Section 4.3). This is the strongest on-policy evidence that training–inference discrepancy alone can break training even when θ = θ_old.
  • Applying R3 reduces training–inference KL but does not improve performance in this on-policy regime, and can worsen performance when combined with length normalization (Figure 1; Section 4.3). The paper interprets this as evidence of Routing Replay’s bias (Section 4.3 referencing the concern in Section 3.2).

Off-policy settings (multiple gradient updates per rollout batch)

  • Setup.
  • Mini-batch size fixed: mbs = 1,024.
  • Global batch sizes varied:

    • gbs = 2,048 (N=2) (Figure 2; Section 4.4),
    • gbs = 4,096 (N=4) (Figure 3; Section 4.4),
    • gbs = 8,192 (N=8) (Figure 4; Section 4.4).
  • Methods compared (Section 4.4).

  • MiniRL (no clipping) (explicitly removing clipping).
  • MiniRL + R2 (no clipping).
  • MiniRL + R2 (with clipping).
  • MiniRL + R3 (with clipping, as MiniRL includes clipping by default in Section 4.1).

  • Findings (Figures 2–4; Section 4.4 bullets).

  • Once off-policy updates are introduced, both Routing Replay and clipping are essential: removing either leads to premature collapse and lower peak benchmark scores (Figures 2–3; Section 4.4).
  • Choice of replay variant depends on degree of off-policiness:
    • For smaller off-policy (gbs = 2 × mbs), R2 outperforms R3 (Figure 2; Section 4.4).
    • For larger off-policy (gbs = 4 × mbs and gbs = 8 × mbs), R3 surpasses R2, and R2 may fail to maintain stability (Figures 3–4; Section 4.4).
  • The paper’s interpretation ties back to Table 1: R3 introduces bias immediately (even first mini-batch) but may better preserve approximation validity under high staleness, whereas R2 preserves the original target policy in the first mini-batch but becomes insufficient when staleness is large (Section 4.4 referencing Section 3.2 and Table 1).

Varying cold-start initializations

  • Setup (Section 4.5).
  • Different cold-start data distilled from three frontier models (names listed in Section 4.5).
  • Uses an “early-experimental small Qwen3Next MoE model” (Section 4.5).
  • Batch/generation settings: gbs = 4,096, mbs = 2,048 (B=128, G=16, N=2), generation length 65,536 tokens (Section 4.5).
  • Training recipe: MiniRL + R2 (Section 4.5).

  • Findings (Figure 5; Section 4.5).

  • The three cold-start initializations reach comparable final benchmark performance curves (Figure 5).
  • The paper also plots response length dynamics (Figure 5) and provides detailed per-benchmark/length plots (Figure 10), using this to argue that with stabilized RL, cold-start differences “vanish” with prolonged optimization (Section 4.5; also highlighted in Abstract).

5.3 Do the experiments support the paper’s claims?

  • Supportive points.
  • The on-policy ablation “remove training–inference IS correction” causing collapse (Figure 1; Section 4.3) is strong evidence for the paper’s thesis that training–inference discrepancy must be corrected for the approximation to be valid (Section 2.4; Eq. (5)).
  • Off-policy results align with the “policy staleness” story: increasing reuse (N=2 → 4 → 8) increases instability unless clipping + replay are used (Figures 2–4; Section 4.4), matching the condition in Section 2.4.
  • The MoE-specific emphasis is exercised in practice: Routing Replay is empirically necessary in off-policy settings (Section 4.4), consistent with Section 3’s argument that routing breaks token-level ratios.

  • Where evidence is suggestive but not definitive (based on what’s shown).

  • The claim that Routing Replay’s benefit comes from restoring first-order approximation validity is plausible and consistent with the plotted KL/entropy diagnostics, but the paper does not provide a direct quantitative measure of “approximation error” vs stability—only proxies like training–inference KL and qualitative collapse behavior (Section 4.2–4.4).
  • The bias discussion is empirically hinted by R3 reducing training–inference KL without improving on-policy performance (Figure 1; Section 4.3), but the paper does not isolate bias vs variance in a controlled decomposition beyond Table 1’s conceptual comparison.

  • Ablations/failure cases/robustness checks included.

  • On-policy: length normalization ablation and removing training–inference IS ablation, plus applying R3 combinations (Section 4.3; Figure 1; Appendix B Figure 6).
  • Off-policy: turning off clipping; comparing R2 vs R3 across three staleness levels (Section 4.4; Figures 2–4; Appendix B Figures 7–9).
  • Cold-start: three initializations with the same stabilized recipe (Section 4.5; Figure 5; Appendix B Figure 10).

6. Limitations and Trade-offs

  • First-order approximation is only justified in a “small mismatch” regime.
  • The entire conceptual bridge relies on token-level ratios being close to 1 so that higher-order terms can be neglected (Section 2.3–2.4). When mismatch grows (due to staleness, routing flips, engine discrepancy), the approximation can break, which manifests as instability/collapse in experiments (Figures 1–4).

  • Routing Replay stabilizes but biases the optimized policy.

  • R2/R3 explicitly alter the target policy by constraining routing to replayed experts (Section 3.2; Table 1). This is not just an implementation detail; it changes what objective is being optimized.
  • The on-policy result that R3 reduces KL but does not improve performance is consistent with “bias can hurt” (Section 4.3; Figure 1), but the paper does not quantify the bias directly.

  • Scope restricted to policy-gradient without token-level value models.

  • The paper explicitly excludes value-based settings like PPO with a value model that scores each token, stating it is “inherently difficult” to obtain reliable general value models (Section 2.1). This means conclusions are primarily about REINFORCE-like objectives and their variants, not actor-critic stability.

  • Empirical scope: one task type and reward type.

  • Training is on math reasoning with binary rewards (Section 4.2). It is unclear from this paper alone how the recipes transfer to:

    • dense or noisy rewards,
    • preference-model rewards,
    • multi-objective RL (helpfulness/harmlessness),
    • non-math domains.
  • Incomplete reporting for full reproducibility (from the provided content).

  • While batch sizes, lengths, clipping thresholds, precision, and compute/step are provided (Section 4.2), key training details are not specified here:
    • optimizer and its settings,
    • learning-rate schedule,
    • number of training steps in tokens,
    • exact MoE architecture details,
    • hardware configuration.
  • This limits how precisely an external reader could replicate the exact training dynamics.

  • Compute cost and practicality.

  • The study is very expensive (“hundreds of thousands of GPU hours”; 5–6 GPU hours per gradient step) (Section 4.2), which makes exhaustive tuning/validation hard for most practitioners and raises the stakes of needing stable recipes.

7. Implications and Future Directions

  • Conceptual impact: stabilization practices become interpretable constraints.
  • The decomposition in Eq. (5) provides a practical mental model:
    • If training collapses even on-policy, suspect training–inference discrepancy and missing correction.
    • If training collapses when reusing rollouts, suspect policy staleness and overly aggressive updates (Section 2.4; Section 4.3–4.4).
  • This reframes clipping, IS correction, and routing replay as mechanisms to keep the system inside the “first-order-valid” regime (Section 2.4; Section 3; Section 4).

  • Practical applications / downstream use cases.

  • Any RL pipeline for LLMs that:

    • generates in a separate inference stack (common in large-scale systems),
    • uses MoE architectures,
    • reuses rollouts for efficiency (off-policy updates), can apply the paper’s recipes to reduce collapse risk (Sections 2–4).
  • Repro/Integration Guidance (recipe-level, based on this paper).

  • If you can afford on-policy updates (gbs = mbs):

    • Prefer the “basic policy gradient with training–inference IS correction” (MiniRL’s on-policy degeneration) because it is empirically the most stable in Figure 1 (Section 4.3).
    • Do not drop training–inference IS correction in split-engine settings; the paper shows rapid collapse without it (Figure 1; Section 4.3).
    • Be cautious with length normalization in the objective; here it is linked to suboptimal performance despite stability (Section 4.3).
    • Routing Replay (R3) is not shown to help on-policy performance and may introduce harmful bias (Figure 1; Section 4.3).
  • If you need faster convergence via off-policy reuse (gbs > mbs):

    • Use clipping (MiniRL’s mask M_t) and Routing Replay; omitting either destabilizes training in the paper’s experiments (Figures 2–3; Section 4.4).
    • Choose replay variant by staleness level:
    • Mild off-policy (N=2): R2 performs better than R3 (Figure 2; Section 4.4).
    • Higher off-policy (N=4 or N=8): R3 becomes better and may be necessary for stability (Figures 3–4; Section 4.4).
  • If you are using low-precision inference (e.g., FP8) with higher-precision training:

    • The paper intentionally uses FP8 inference and BF16 training as a stress test (Section 4.2), and the strong dependence on training–inference IS correction suggests you should expect mismatch-driven instability unless corrected (Section 4.3).
  • Research directions suggested by the paper’s gaps (grounded to what it flags).

  • Better clipping/masking strategies: the paper notes there are alternatives (e.g., sequence-level ratio clipping) but leaves them for future work (footnote near Section 4.1).
  • Better advantage estimators beyond the simple group-normalized reward baseline (Section 4.1).
  • More direct measurement of approximation validity (beyond proxies like KL/entropy), since the core theoretical claim hinges on small higher-order terms (Section 2.3–2.4).
  • Broader-domain validation beyond binary-reward math problems (limitation implied by Section 4.2’s narrow setup).

If you want, I can also produce a compact “implementation checklist” of MiniRL + (IS correction, TIS, clipping, R2/R3) strictly using the exact equations and hyperparameters given here (Eq. (7); Section 4.2), but I’m not including it now because your instruction was to follow the system-prompt structure exactly without adding extra sections.