Diversity or Precision? A Deep Dive into Next Token Prediction¶

🎯 Pitch¶

Reframes next-token cross-entropy as a single-step policy-gradient and introduces a reward‑shaping pre‑training objective that explicitly trades off precision (peaked probability on the ground truth) versus diversity (higher local/global entropy). By sculpting the token-output distribution before RL, the paper shows that precision-oriented priors (e.g., rewarding positives or suppressing tail negatives) create a more effective exploration space for on‑policy RL, yielding consistently better end‑to‑end reasoning performance across dense and MoE models up to 10B.

1. Executive Summary (2-3 sentences)¶

This work studies how the pre-trained next-token probability distribution controls the “exploration space” available to later reinforcement learning (RL) for reasoning in large language models (LLMs). It reframes standard next-token cross-entropy as a single-step policy-gradient objective and then introduces a reward-shaping pre-training loss that can explicitly trade off precision (peaked probability on the correct token) vs diversity (higher-entropy distributions). Across dense and MoE models up to 10B-A0.5B, the experiments show that a precision-oriented pre-training prior (e.g., β < 0 or penalizing tail negative tokens) yields better downstream RL reasoning performance than a global high-entropy prior.

2. Context and Motivation¶

Problem / gap.
RL training for reasoning (especially on-policy RL with verifiable rewards) is sensitive to what the model can plausibly generate early in training.
That early behavior is determined by the pre-trained model’s token distribution πθ(· | s), so pre-training implicitly sets the “reachable” exploration space for later RL.
Why it matters.
If the initial token distribution makes the model explore poorly (e.g., collapses too early, or wastes mass on implausible tail tokens), RL may converge slowly or to worse solutions—even if RL has strong rewards.
A controllable way to shape the token distribution during pre-training could improve end-to-end reasoning, not just perplexity.
Prior approaches and their shortcomings (as framed here).
Standard pre-training uses cross-entropy (teacher forcing) and does not explicitly assign meaningful structure to “negative” tokens beyond the softmax constraint.
Common modifications like label smoothing and focal loss change the effective weighting of examples/tokens, but they are not presented as a unified on-policy reward design for exploration shaping.
Positioning.
The core conceptual move is to treat next-token prediction as a stochastic decision process and interpret cross-entropy as a special case of policy gradient in a single-step episode.
Then, the work proposes a generalized reward function that can systematically manipulate entropy and negative-token treatment to study downstream RL effects.

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a pre-training objective for autoregressive LLMs that replaces plain cross-entropy with a policy-gradient-style, reward-shaped next-token training rule.
It solves the problem of controlling the pre-trained token distribution (entropy, head vs tail mass) so that subsequent on-policy RL has a better exploration starting point.

3.2 Big-picture architecture (diagram in words)¶

Input: A context/prefix s_t = X_{<t} and a ground-truth next token x_t from the dataset.
Model policy: The LLM defines a token distribution πθ(· | s_t) over vocabulary V.
Action sampling (on-policy): Sample a token a_t ~ πθ(· | s_t).
Reward shaping module: Compute a token-level reward \bar r(s_t, a_t) using:
a positive reward term (only if a_t = x_t) scaled by β, and
a rank-aware negative reward term (if a_t ≠ x_t) using Top-k sets and (\tilde λ, \hat λ).
Policy-gradient update: Apply E[ \bar r(s_t, a_t) ∇θ log πθ(a_t | s_t) ] (with stop-gradient through reward construction).

3.3 Roadmap for the deep dive¶

Explain next-token prediction as an RL problem (states/actions/rewards).
Derive cross-entropy as a single-step policy-gradient with an intrinsic reward.
Introduce the generalized reward with:
global entropy control via β (positive token shaping),
local entropy/head–tail control via Top-k and (\tilde λ, \hat λ) (negative token shaping).
Connect reward shaping to distributional effects (entropy, head/tail mass) and why that matters for downstream RL.
Detail the training pipeline (pre-training → mid-training → RLVR) and the exact hyperparameters used in experiments.

3.4 Detailed, sentence-based technical breakdown¶

Framing. This is primarily an algorithmic/objective-function paper with an empirical study: the core idea is that next-token pre-training can be written as single-step policy optimization, which enables explicit reward shaping to sculpt the initial policy for later RL.
Step 1: Next-token prediction as a decision process (Eq. (1)–(5)).
A token sequence is X = {x1, …, xn}.
At time t, the state is the prefix s_t = X_{<t} and the action is the next token a_t ∈ V, sampled from the model policy πθ(· | s_t).
A general RL objective maximizes expected cumulative reward:
- J(θ) = E_{τ~πθ} [ Σ_{t=1..n} r(s_t, a_t) ] (Eq. (1)),
- with policy gradient ∇θ J(θ) = E[ Σ_t (G_t - b(s_t)) ∇θ log πθ(a_t|s_t) ] (Eq. (3)).
To connect to next-token prediction, the work treats each token emission as a complete episode, i.e., for a fixed state s_t:
- J_t(θ | s_t) = E_{a_t~πθ(·|s_t)} [ r(s_t, a_t) ] (Eq. (4)),
- ∇θ J_t(θ | s_t) = E[ r(s_t, a_t) ∇θ log πθ(a_t | s_t) ] (Eq. (5)).
A key constraint highlighted is that for this to be consistent with the single-step formulation, the reward r(s_t, a_t) must depend only on the immediate (s_t, a_t) pair.
Step 2: Cross-entropy as a special case (Eq. (6)–(10)).
Standard supervised pre-training maximizes the log-likelihood of the ground-truth next token:
- J_CE(θ) = log πθ(x_t | s_t) (Eq. (6)),
- ∇θ J_CE(θ) = ∇θ log πθ(x_t | s_t) (Eq. (7)).
The work rewrites this gradient as an expectation over the model’s own distribution πθ(·|s_t) by inserting an indicator 1(a_t = x_t) and converting sums over the vocabulary into expectations (Eq. (8)–(9)).
This yields an “intrinsic reward” view of cross-entropy:
- r_CE(s_t, a_t) = sg( 1(a_t = x_t) / πθ(a_t | s_t) ) (Eq. (10)),
- where sg(·) is a stop-gradient operator (the reward is treated as a constant w.r.t. θ when taking gradients).
Mechanistically:
- If the sampled token matches the ground truth, reward is 1 / πθ(x_t|s_t), so low-probability correct tokens get larger reward, producing larger updates.
- If the token is incorrect, reward is exactly 0.
- Negative tokens are suppressed implicitly via the softmax constraint Σ_{a∈V} πθ(a|s_t) = 1: increasing the correct token’s probability forces others down.
Micro-example (to make Eq. (10) concrete).
Suppose the vocabulary is {A, B, C} and the ground-truth token is B.
If the model currently assigns πθ(B|s) = 0.05, then when a=B is sampled, r_CE = 1/0.05 = 20, making the gradient step large and pushing probability mass strongly toward B.
If instead πθ(B|s) = 0.8, then r_CE = 1/0.8 = 1.25, yielding a smaller update (the model is already confident).
If the sample is A or C, reward is 0, and the only “pressure” on those tokens comes indirectly from raising πθ(B|s).
Step 3: Generalized reward shaping to control diversity vs precision (Eq. (11)–(13)).
The work introduces a generalized single-step reward \bar r(s_t, a_t) that separately designs: 1) positive-token reward strength (global entropy control), and
2) negative-token treatment (local entropy/head–tail control).
(A) Positive reward shaping with β (Eq. (11)).
- For the correct token case (a_t = x_t), define:
- \bar r_pos(s_t, a_t) = sg( (1 / πθ(a_t|s_t)) * (1 - πθ(a_t|s_t))^β ) (Eq. (11)).
- Interpretation:
- The factor (1 - π)^β modulates reward depending on confidence.
- If β < 0, then when π is not near 1, (1-π)^β is larger, amplifying the reward and pushing the model to concentrate mass more aggressively on the ground truth—lower global entropy / higher precision.
- If β > 0, the reward is attenuated, and the model is less forced to concentrate on the ground truth—higher global entropy / more diversity.
- The baseline cross-entropy behavior is recovered at β = 0 (since (1-π)^0 = 1).
(B) Rank-aware negative shaping with Top-k and (\tilde λ, \hat λ) (Eq. (12)).
- Define K_t = TopK(πθ(· | s_t), k), the set of the top-k most probable tokens under the current policy.
- For incorrect tokens (a_t ≠ x_t), assign:
- \bar r_neg(s_t, a_t) = \tilde λ · 1(a_t ∈ K_t ∧ a_t ≠ x_t) + \hat λ · 1(a_t ∉ K_t ∧ a_t ≠ x_t) (Eq. (12)).
- Interpretation:
- \tilde λ controls how the model treats high-ranking incorrect alternatives (the “head” competitors).
  - If \tilde λ is positive, it rewards sampling plausible-but-wrong head tokens, which can preserve multiple plausible continuations.
- \hat λ controls how the model treats low-probability tail tokens.
  - If \hat λ is negative, it penalizes the tail, concentrating mass into the head and away from implausible tokens.
(C) Combined reward (Eq. (13)).
- The final reward is:
- \bar r(s_t, a_t) = \bar r_pos(s_t, a_t) · 1(a_t = x_t) + \bar r_neg(s_t, a_t) · 1(a_t ≠ x_t) (Eq. (13)).
- Setting β = 0, \tilde λ = 0, \hat λ = 0 recovers standard cross-entropy as a special case.
Step 4: Training pipeline and configurations (Section 3.1 + Appendices A/B).
The end-to-end pipeline has three stages: 1) Pre-training: 500B tokens (general knowledge focused). 2) Mid-training: 100B tokens, includes ~5% synthetic data and more reasoning-oriented content; synthetic long-reasoning data is deliberately excluded to observe “activation trends” of long-CoT capability. 3) RLVR: on mathematical reasoning tasks.
Architectures (Table 1). Models include:
- Dense: 1B (28 layers, d_model=1536, d_ffn=4608, n_head=16, n_kvhead=4), and 4B (36 layers, d_model=2560, d_ffn=9728, n_head=32, n_kvhead=8).
- MoE: 5B-A0.3B (12 layers, d_model=1024, d_expert=320, n_head=32, n_kvhead=4, total experts E=384, active experts E_a=12) and 10B-A0.5B (16 layers, d_model=1536, d_expert=320, n_head=32, n_kvhead=4, E=384, E_a=12).
- MoE training uses an “auxiliary loss free approach” (Appendix A.2 references Liu et al., 2024), but the internal details are not expanded in the provided excerpt.
Pre-training + mid-training optimizer/schedule (Appendix A.1).
- Optimizer: AdamW, weight decay 0.1.
- Gradient clipping: 1.0.
- Learning-rate schedule: “warmup-stable-decay.”
- Pre-training: warmup 2000 steps, then stable LR 3 × 10^-4 over 500B tokens.
- Mid-training: LR decays from 3 × 10^-4 to 3 × 10^-5 over 100B tokens.
- Global batch size: 16M (as written).
- Sequence length: max 4096 in pre-training; extended to 16384 in mid-training.
- Long-context adjustment: RoPE base frequency increased from 1e4 to 1e6 in mid-training.
Reward-shaping hyperparameters explored (Section 3.1).
- Positive shaping: β = -0.25 (precision/low entropy) and β = 0.5 (higher entropy) compared to baseline β=0.
- Negative shaping:
- Tail penalty: \hat λ = -0.1, \tilde λ = 0, k=100
- Head reward: \hat λ = 0, \tilde λ = 0.1, k=100
RLVR setup (Appendix B.1).
- Algorithm: on-policy GRPO (named; internal derivation not given here).
- No KL regularization.
- Stabilization: “clip-higher” and “dynamic sampling” strategies (named, not fully specified).
- Two-stage RL sequence length: first 700 steps at 8K, then continue at 16K.
- RL batch size: 128.
- RL learning rate: constant 1 × 10^-6.
- During RL training: sample 16 outputs per prompt, temperature 1.0.

4. Key Insights and Innovations¶

(1) Cross-entropy as single-step policy gradient with an explicit intrinsic reward (Eq. (10)).
Novelty here is not “policy gradient exists,” but the specific mapping:
- cross-entropy gradient equals E_{a~πθ}[ r_CE(s,a) ∇ log πθ(a|s) ] with r_CE(s,a)=sg(1(a=x)/πθ(a|s)).
Significance: this makes next-token pre-training directly compatible with RL-style reward design, enabling controlled experiments on exploration shaping.
(2) A unified reward-shaped pre-training objective that subsumes cross-entropy (Eq. (11)–(13)).
The method provides two “knobs”:
- β for global entropy / precision via positive reward scaling,
- (\tilde λ, \hat λ, k) for local entropy / head–tail redistribution via rank-aware negative rewards.
Significance: it separates effects that cross-entropy conflates (rewarding correct token vs structuring negatives).
(3) Rank-aware asymmetry between high-probability and tail negatives (Eq. (12)).
Many losses treat all negatives similarly (implicitly or explicitly).
Here, negatives are split into:
- plausible competitors (Top-k), and
- tail tokens (outside Top-k), with different reward signs/magnitudes.
Significance: this targets where diversity is preserved (head) vs suppressed (tail), which the experiments link to RL stability and performance.
(4) Empirical claim: precision-oriented priors can improve RL exploration more than high entropy.
The central empirical takeaway contradicts the simple heuristic “higher entropy ⇒ better exploration.”
Significance: it suggests exploration quality depends on structured probability mass (credible head vs noisy tail), not just entropy magnitude.

5. Experimental Analysis¶

Evaluation methodology (Section 3.2)¶

Base-model evaluation (pre-trained and mid-trained checkpoints).
Capabilities grouped into:
- Knowledge-based: general knowledge + commonsense reasoning.
- Reasoning-based: logical reasoning + mathematics + coding.
Benchmarks (19 total) include:
- General knowledge: MMLU (4-shot, CoT), MMLU-Pro (5-shot, CoT), TriviaQA (5-shot), NaturalQuestions (5-shot).
- Commonsense: Hellaswag, SIQA, PIQA, WinoGrande, OpenBookQA, CommonsenseQA.
- Logic: ARC-Easy, ARC-Challenge, BBH (3-shot, CoT).
- Math: GSM8K, MATH-500, Minerva, OlympiadBench.
- Code: HumanEval+, MBPP+.
For tasks needing multiple samples (math/code), they use Pass@k with the unbiased estimator (Eq. (14)), sampling:
- m = 128 responses, temperature 0.7, top-p 0.95, reporting Pass@64.
Max output length:
- 4K for pre-trained models,
- 16K for mid-trained models.
RL model evaluation.
Datasets: AMC23, AIME (labeled as AIME24 and AIME25 in RL tables), MATH-500, Minerva, OlympiadBench.
Sampling: 128 responses per problem.
Metrics:
- Avg@128: average accuracy over 128 samples.
- Cons@128: majority-vote accuracy over 128 samples.
- Pass@64: pass rate with 64 (from the same sampling pool notion).

Main quantitative results¶

(A) Pre-training: perplexity converges similarly, entropy shifts (Figures 1–2)¶

Figures 1–2 show:
PPL curves converge to similar low values across configurations for both dense and MoE models.
Entropy is strongly affected by β:
- β < 0 reduces entropy (more peaked distribution),
- β > 0 maintains higher entropy (flatter distribution).
This supports the claim that reward shaping can change distributional properties without obviously harming next-token predictive convergence (as measured by PPL).

(B) Pre-training: performance at 500B tokens favors precision as scale grows (Tables 2–5)¶

The pattern is most visible in larger models (4B, 5B MoE, 10B MoE):
4B dense, overall average at 500B tokens (Table 3):
- β=-0.25: 43.11
- β=0: 42.62
- β=0.50: 42.44
10B-A0.5B MoE, overall average at 500B tokens (Table 5):
- β=-0.25: 44.89
- β=0: 44.11
- β=0.50: 44.52
Interpretation consistent with the paper’s narrative: precision-oriented shaping (β < 0) tends to help scaling/performance at larger capacity, even if not uniformly best at small scale (e.g., 1B has mixed deltas in Table 2).

(C) Mid-training: `β=-0.25` is consistently strong; negative shaping effects are smaller (Tables 10–13)¶

4B dense at 100B mid-training tokens (Table 10):
Average: β=-0.25 → 52.76, β=0 → 52.75, β=0.50 → 52.23.
The main separation is between β=0.50 and the other two.
10B-A0.5B MoE at 100B mid-training tokens (Table 11):
Average: β=-0.25 → 51.13, β=0 → 50.85, β=0.50 → 50.60.
Negative shaping in mid-training:
For 4B (Table 12), \hat λ=-0.1 yields the best overall average at 100B (53.00).
For 10B (Table 13), \tilde λ=0.1 is slightly higher (51.06) than \hat λ=-0.1 (50.92) and baseline (50.85) at 100B.
This suggests mid-training results do not cleanly pick a single negative-shaping winner, but they do not contradict the main downstream RL findings.

(D) RLVR: precision-oriented pre-training improves downstream reasoning (Tables 14–23)¶

The RL stage is where the clearest separations appear, especially for the 10B-A0.5B MoE model.

Global entropy control via β (baseline vs β=-0.25 vs β=0.50).
10B-A0.5B MoE, RL Average Pass@64 at step 1000:
- Baseline β=0 (Table 15): 47.52
- Precision-oriented β=-0.25 (Table 18): 50.75
- Higher-entropy β=0.50 (Table 19): 49.59
This directly supports the main claim: β=-0.25 provides a better RL initialization than both baseline and high-entropy in this setting.
For 4B dense, differences are smaller, but β=-0.25 beats β=0.50 at step 1000 on average Pass@64:
- β=-0.25 (Table 16): 50.43
- β=0.50 (Table 17): 50.09
- Baseline (Table 14): 50.99 (here baseline is slightly higher than β=-0.25 on the final average, so the “always better than CE” claim does not strictly hold for 4B final-average—while the paper emphasizes trajectory and robustness across metrics).
Local head–tail shaping via (\hat λ, \tilde λ).
Tail penalty tends to be strongest in RL, especially for 10B.
10B-A0.5B MoE, RL Average Pass@64 at step 1000:
- Baseline (Table 15): 47.52
- Tail penalty \hat λ=-0.1 (Table 22): 49.87
- Head reward \tilde λ=0.1 (Table 23): 49.42
4B dense, RL Average Pass@64 at step 1000:
- Baseline (Table 14): 50.99
- Tail penalty \hat λ=-0.1 (Table 20): 51.18
- Head reward \tilde λ=0.1 (Table 21): 50.26
Concrete dataset-level improvements (illustrative examples).
For 10B, MATH-500 Pass@64 at RL step 1000 increases from 85.75 (baseline, Table 15) to 86.89 (β=-0.25, Table 18), and Minerva Pass@64 is 44.90 (baseline) vs 44.46 (β=-0.25)—so gains are not uniform per dataset, but the overall average improves substantially due to broader gains (notably on AMC23 and OlympiadBench in the tables).
For 4B, the average improvements from negative shaping are modest but positive for tail penalty (51.18 vs 50.99).

Do the experiments support the claims?¶

Supported well:
Reward shaping changes entropy without derailing perplexity convergence (Figures 1–2).
Precision-oriented configurations improve RL outcomes for the 10B MoE model with clear margins (Tables 15 vs 18; 15 vs 22).
Tail-token suppression (\hat λ < 0) is particularly beneficial in RL across both 4B and 10B (Tables 20, 22).
More conditional / nuanced:
For 4B dense, the baseline CE configuration is competitive and sometimes slightly better in the final averaged Pass@64 than β=-0.25 (Tables 14 vs 16), though β=-0.25 still beats the high-entropy β=0.50 and the paper emphasizes trajectory/robustness across metrics.
Negative shaping in mid-training is not uniformly dominated by one setting (Table 13 slightly favors \tilde λ=0.1 at 100B on the 10B model).
Ablations / robustness checks present in the provided content:
Systematic sweeps over β and over negative-shaping settings (\hat λ=-0.1 vs \tilde λ=0.1) across multiple model sizes and architectures.
RL metrics include Avg@128, Cons@128, and Pass@64, which helps distinguish “average sample quality” from “majority vote” stability and from “best-of-many” capability.

6. Limitations and Trade-offs¶

Single-step episode assumption is specific.
The derivation treats each token emission as a full episode, requiring the reward to be a function only of (s_t, a_t).
This is appropriate for next-token prediction, but it does not directly model multi-step credit assignment within a reasoning trace at the pre-training stage.
Top-k negative shaping introduces discrete, rank-based behavior.
The negative reward depends on membership in TopK(πθ(·|s_t), k), which changes discontinuously as ranks change.
The paper uses k=100 in experiments; sensitivity to k beyond this value is not shown in the provided content.
Hyperparameter coverage is limited.
Only a few settings are explored (β ∈ { -0.25, 0, 0.5 }, \hat λ ∈ { -0.1, 0 }, \tilde λ ∈ { 0, 0.1 }).
It remains unclear from the excerpt how robust the conclusions are to intermediate values (e.g., β=-0.1 vs β=-0.5) or combined nonzero \hat λ and \tilde λ simultaneously.
RL scope is math-focused and uses a specific RL setup.
RLVR is run “prioritizing mathematical reasoning tasks,” and uses on-policy GRPO without KL regularization (Appendix B.1).
Conclusions about exploration may change under other RL algorithms, reward models, or KL-constrained objectives; this is not tested here (based on the provided text).
Entropy is not the only confound; response length dynamics appear important.
Figure 7 (as described) links entropy collapse with drastic response-length decrease in the β=0.5 setting, suggesting that the failure mode may be about prematurely shortening reasoning rather than “diversity” per se.
However, the exact causal mechanism (why higher-entropy pre-training leads to faster entropy collapse during RL) is observed empirically but not fully derived.
Compute/throughput details are incomplete in the excerpt.
The excerpt provides token counts, batch sizes, learning rates, and sequence lengths, but not hardware, wall-clock time, or compute budget (e.g., PF-days), so reproducibility at the systems level is only partially specified.

7. Implications and Future Directions¶

How this changes the landscape (within the paper’s scope).
It reframes pre-training not just as “minimize perplexity,” but as “initialize a policy” whose distributional shape can materially affect later on-policy RL.
It argues that “more entropy” is not a reliable proxy for “better exploration”; precision-oriented priors (especially suppressing tail tokens) can yield better RL learning dynamics and final reasoning performance.
Follow-up research directions suggested by the presented framework.
Richer reward design for negatives: Explore combined settings where both \tilde λ and \hat λ are nonzero to jointly preserve plausible alternatives while suppressing tail noise.
Sensitivity studies: Sweep k in TopK(·,k) and expand the grid for β, \tilde λ, \hat λ to map phase transitions (e.g., when entropy collapse occurs).
Task generalization: Apply the same pre-training shaping to RL on domains beyond math (e.g., code, tool use), to test whether the precision-oriented prior generalizes.
Mechanistic diagnostics: Since Figure 7 links entropy and response length, future work could explicitly incorporate length-aware constraints/rewards or analyze which token types (“forking tokens”) are most impacted by the shaping.
Practical applications / downstream use cases.
If you plan to do on-policy RL for verifiable reasoning (unit tests, math proof checking, etc.), this work suggests that shaping the pre-training distribution toward:
- stronger correct-token concentration (β < 0), and/or
- tail-token suppression (\hat λ < 0), can improve RL outcomes even when base perplexity looks similar.
Repro/Integration Guidance (based on provided details).
A minimal integration path is to replace cross-entropy with the single-step policy-gradient objective using \bar r(s_t,a_t):
- Use the same optimizer as the baseline (AdamW, weight decay 0.1, grad clip 1.0) and keep the same LR schedule, so the only change is reward shaping (as done in the experiments).
If your main goal is downstream RL performance, the strongest empirical configuration in the provided RL tables is the tail-penalty negative shaping (β=0, \hat λ=-0.1, \tilde λ=0, k=100) on the 10B MoE model:
- 10B-A0.5B MoE, RL Average Pass@64 at step 1000: baseline 47.52 (Table 15) vs 49.87 with \hat λ=-0.1 (Table 22).
If you prefer a single scalar knob affecting global entropy, β=-0.25 shows a strong RL improvement for 10B:
- 10B-A0.5B MoE, RL Average Pass@64 at step 1000: 47.52 (β=0, Table 15) vs 50.75 (β=-0.25, Table 18).