Skip to content

How Far Can Unsupervised RLVR Scale LLM Training?

ArXiv: 2603.08660

Pitch

This work delivers the first comprehensive theoretical framework for Unsupervised RLVR, revealing that intrinsic reward methods inevitably converge toward sharpening the model's initial distribution rather than discovering new knowledge. By demonstrating that these methods universally follow a 'rise-then-fall' pattern where performance gains collapse once confidence diverges from correctness, the authors establish clear boundaries for intrinsic rewards while charting a path toward scalable external verification alternatives.


1. Executive Summary

This paper provides a comprehensive theoretical and empirical analysis of Unsupervised Reinforcement Learning with Verifiable Rewards (URLVR), revealing that all intrinsic reward methods converge toward sharpening the model's initial distribution rather than discovering new knowledge. The core contribution is demonstrating that intrinsic URLVR universally follows a rise-then-fall pattern: early gains reflect alignment between model confidence and correctness, but inevitable collapse occurs when this alignment breaks down. The paper establishes boundaries for intrinsic methods—showing they work safely in test-time training on small datasets (≤128 samples) but cannot scale indefinitely—and proposes Model Collapse Step as a practical indicator of RL trainability that predicts performance gains 5.6× faster than running full RL training while requiring no ground-truth labels.

2. Context and Motivation

The Supervision Bottleneck in LLM Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has driven recent breakthroughs in LLM reasoning capability. Models like DeepSeek-R1, Gemini 2.5, and the Qwen3 series achieved remarkable performance on mathematics, coding, and science benchmarks by scaling supervised RLVR. In RLVR, models learn from rewards verified against ground truth—such as whether a mathematical solution is correct or code executes successfully.

However, this approach faces a fundamental scalability limitation: obtaining ground truth supervision requires human annotation, which becomes prohibitively expensive as models reach or surpass human expertise. The paper frames this as the "supervision bottleneck"—the core problem that motivates the entire work.

The Promise and Uncertainty of Unsupervised RLVR

Unsupervised RLVR (URLVR) promises to extend the RLVR paradigm beyond human-provided labels by deriving rewards without ground truth. This is analogous to how pretraining scaling laws transformed computation into intelligence on vast unlabeled data—the authors aim to extend this to post-training.

Prior works explored intrinsic reward methods: - TTRL (Test-Time Reinforcement Learning): Uses majority voting across multiple rollouts to create pseudo-labels - EM-RL/RENT: Uses entropy minimization as rewards - RLIF: Uses self-certainty (KL divergence from uniform) as rewards - RLSC/RLSF: Various probability-based confidence measures

These methods showed promising early training gains, but growing concerns emerged around reward hacking (optimizing proxy rewards at the expense of actual correctness) and model collapse (degeneration to deterministic, often incorrect outputs). The field lacked systematic understanding of when these methods work, why they fail, and whether they can truly scale.

The Gap This Paper Fills

The paper identifies several critical gaps in prior work:

  1. No unified theoretical framework. Various URLVR methods were proposed with different mathematical formulations, but no analysis unified them or explained their common behavior.

  2. Conflicting empirical findings. Some studies reported encouraging gains while others highlighted failure modes, without clear boundary conditions.

  3. Unknown scalability limits. The field lacked systematic study of how far intrinsic rewards can scale—whether collapse is an engineering problem or fundamental limitation.

  4. No practical indicators. Without understanding failure mechanisms, practitioners lacked tools to predict when RL training would succeed.

How This Paper Positions Itself

The paper positions itself as a comprehensive analysis spanning taxonomy, theory, and extensive experiments. Rather than proposing a single new method, it:

  1. Classifies URLVR methods into intrinsic (certainty-based and ensemble-based) versus external (unlabeled data and generation-verification asymmetry) categories
  2. Develops theoretical framework revealing that all intrinsic methods share a sharpening mechanism
  3. Provides systematic experiments showing consistent rise-then-fall patterns across methods
  4. Identifies safe application domains (test-time training on small datasets) and fundamental limits
  5. Proposes practical indicators (Model Collapse Step) for predicting RL trainability
  6. Explores alternatives through external reward methods like self-verification

The paper explicitly states that intrinsic rewards are "fundamentally bounded by what the model already knows" and positions external reward methods as the more promising direction for long-term scalability.

3. Technical Approach

3.1 Reader Orientation

This paper builds a theoretical framework for understanding why all intrinsic URLVR methods converge toward deterministic policies, then validates this framework through systematic experiments that reveal rise-then-fall training dynamics. The "system" being analyzed is not a new algorithm but rather a unified lens for understanding existing methods—showing that diverse intrinsic rewards all manipulate cross-entropy to sharpen distributions.

3.2 Big-Picture Architecture (Diagram in Words)

The paper's analytical framework has four major components:

  1. Taxonomy Module — Classifies URLVR methods into intrinsic (certainty-based, ensemble-based) and external (unlabeled data, generation-verification asymmetry) categories based on reward source.

  2. Theoretical Analysis Module — Derives convergence behavior for intrinsic rewards, showing they induce geometric convergence toward deterministic policies concentrated on initially preferred answers.

  3. Empirical Validation Module — Systematically tests five intrinsic methods across multiple models, datasets, and hyperparameters to validate theoretical predictions and identify failure patterns.

  4. Practical Application Module — Identifies safe application domains (test-time training), proposes Model Collapse Step as a trainability indicator, and explores external reward alternatives.

Information flows: taxonomy provides structure → theoretical analysis predicts convergence → empirical validation confirms patterns and identifies boundaries → practical applications translate findings to deployment guidance.

3.3 Roadmap for the Deep Dive

  • First, the paper's taxonomy of URLVR methods, which establishes the classification scheme and introduces mathematical notation for intrinsic rewards.
  • Second, the theoretical sharpening mechanism, which derives why intrinsic rewards converge toward deterministic policies and establishes the core insight about confidence-correctness alignment.
  • Third, the unified reward framework, which shows that all intrinsic methods manipulate cross-entropy through different anchor distributions and aggregations.
  • Fourth, empirical results on training dynamics, which validate theoretical predictions and reveal distinct failure patterns.
  • Fifth, the Model Collapse Step proposal, which leverages the rise-then-fall pattern as a practical indicator.

3.4 Detailed, Sentence-Based Technical Breakdown

This is primarily an empirical analysis paper with theoretical contributions that unifies diverse URLVR methods under a single framework and systematically characterizes their behavior.


Taxonomy of URLVR Methods

The paper first establishes a classification scheme based on the source of rewards:

Intrinsic reward methods derive rewards solely from the model's internal state or outputs: - Certainty-based rewards measure the model's confidence in its predictions, derived from logit distributions - Ensemble-based rewards measure agreement across multiple independently sampled outputs

External reward methods derive rewards from sources independent of the model's internal state: - Unlabeled data methods leverage structure in text corpora (e.g., next-token prediction as reward) - Generation-verification asymmetry methods exploit domains where checking correctness is easier than generating solutions

The mathematical notation established is: - \(x\): input prompt - \(y = (y_1, \ldots, y_{|y|})\): generated output sequence - \(c\): reasoning trajectory extracted from \(y\) - \(a\): answer extracted from \(y\) (typically from \boxed{} notation) - \(\pi_\theta\): LLM policy parameterized by \(\theta\) - \(\pi_\theta(\cdot|x, y_{<t})\): probability distribution over next token \(y_t\)


Certainty-Based Rewards (Table 1)

Certainty-based rewards formalize the model's confidence through different mathematical estimators:

Self-Certainty (RLIF): Defined as the average KL divergence between a uniform distribution \(U\) over the vocabulary and the model's next-token distribution:

\[r_{SC}(x, y) = \frac{1}{|y|} \sum_{t=1}^{|y|} D_{KL}(U \| \pi_\theta(\cdot|x, y_{<t}))\]

This rewards peaked, low-entropy distributions—high KL from uniform indicates the model is confident.

Token-Level Entropy (EM-RL, RENT): Directly penalizes entropy at each generation step:

\[r_H(x, y) = -\frac{1}{|y|} \sum_{t=1}^{|y|} H(\pi_\theta(\cdot|x, y_{<t}))\]

where \(H(\pi) = -\sum_v \pi(v) \log \pi(v)\) is the Shannon entropy.

Trajectory-Level Entropy (EM-RL): Aggregates log-probability across the entire sequence:

\[r_{Traj}(x, y) = \frac{1}{|y|} \sum_{t=1}^{|y|} \log \pi_\theta(y_t|x, y_{<t})\]

This is equivalent to the sequence log-probability normalized by length.

Probability (RLSC): Uses the raw product of token probabilities:

\[r_{Prob}(x, y) = \prod_{t=1}^{|y|} \pi_\theta(y_t|x, y_{<t})\]

The paper notes this rewards brevity because multiplying probabilities naturally favors shorter sequences.

Probability Disparity (RLSF): Measures the gap between top-2 token probabilities:

\[r_{PD}(x, y) = \frac{1}{M} \sum_{t=1}^{|a|} \left[\max_{a_t} \pi_\theta(a_t|x, c, a_{<t}) - \max_{a_t \neq \arg\max \pi_\theta} \pi_\theta(a_t|x, c, a_{<t})\right]\]

Ensemble-Based Rewards (Table 2)

Ensemble-based rewards operationalize the assumption that consistency across samples correlates with correctness:

Majority Voting (TTRL, SRT, ETTRL, SeRL, SQLM, R-Zero): Samples \(N\) rollouts and uses the most frequent answer as the pseudo-label:

\[r_{MV}(x, y) = \mathbb{1}\left[y = \arg\max_{y'} \sum_{i=1}^{N} \mathbb{1}[y_i = y']\right], \quad \{y_i\}_{i=1}^{N} \sim \pi_\theta(\cdot|x)\]

This provides a binary reward: 1 if the output matches the majority answer, 0 otherwise.

Majority Voting Across Rephrased Questions (Co-Reward): Extends majority voting by aggregating across both original and paraphrased versions of the question, improving robustness.

Self-Consistency Weighted Voting (RLCCF): Uses multi-model collectives where \(N\) different models each generate \(K\) solutions, and voting weights reflect internal consistency within each model.

Semantic Similarity (EMPO): Clusters outputs semantically and rewards based on cluster size:

\[r_{EMPO}(x, y) = \frac{|C(y)|}{G}, \quad C(y) \in \text{SemanticCluster}(\{o_i\}_{i=1}^{G})\]

Trajectory Consistency and Volatility (CoVo): Incorporates intermediate reasoning consistency by rewarding trajectories where intermediate steps agree across samples.


The Sharpening Mechanism: Theoretical Analysis

The paper's core theoretical contribution is showing that intrinsic rewards converge toward sharpening the model's initial distribution. Using majority voting as a representative method, the analysis proceeds as follows:

KL-regularized RL objective: The standard formulation is:

\[\max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} [r(x, y)] - \beta D_{KL}[\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)]\]

where \(\pi_{\text{ref}}\) is the reference policy (typically the model before training) and \(\beta\) controls regularization strength.

Closed-form optimal policy: The optimal policy has the well-known form:

\[\pi_\theta^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)\]

where \(Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)\) is the partition function.

Binary reward structure: For majority voting, the reward at iteration \(k\) is:

\[r_k(x, y) = \mathbb{1}[\text{ans}(y) = \text{maj}_k(Y_k)]\]

where \(Y_k = \{y^{(1)}, \ldots, y^{(N)}\}\) are \(N\) rollouts and \(\text{maj}_k(Y_k)\) is the most frequent answer.

Explicit optimal policy: Due to the binary reward, the exponential term takes only two values (\(e^{1/\beta}\) for majority, \(e^0 = 1\) for others):

\[\pi_\theta^{*,(k+1)}(y|x) = \begin{cases} \frac{\pi_\theta^{(k)}(y|x) \cdot e^{1/\beta}}{Z_k(x)} & \text{if ans}(y) = \text{maj}_k(Y_k) \\ \frac{\pi_\theta^{(k)}(y|x)}{Z_k(x)} & \text{otherwise} \end{cases}\]

where \(Z_k(x) = p_{\text{maj}}^{(k)} \cdot e^{1/\beta} + (1 - p_{\text{maj}}^{(k)})\) and \(p_{\text{maj}}^{(k)}\) is the total probability mass on majority trajectories.

Probability mass update: The probability mass on majority trajectories after one update satisfies:

\[p_{\text{maj}}^{*,(k+1)} = \frac{p_{\text{maj}}^{(k)} \cdot e^{1/\beta}}{p_{\text{maj}}^{(k)} \cdot e^{1/\beta} + (1 - p_{\text{maj}}^{(k)})}\]

Key ordering result: The actual probability mass after one update satisfies:

\[p_{\text{maj}}^{*,(k+1)} \geq p_{\text{maj}}^{(k+1)} \geq p_{\text{maj}}^{(k)}\]

This ordering holds because: (1) policy gradient with positive rewards increases probability of rewarded trajectories, and (2) single-step updates cannot exceed the optimal policy.

Theorem 1 (Geometric Convergence): Under assumptions of (A1) majority stability and (A2) effective learning, the probability mass \(p_{\text{maj}}^{(k)}\) converges geometrically to 1 with rate \(\rho = e^{-1/\beta}\), and the policy converges to:

\[\lim_{k \to \infty} \pi_\theta^{(k)}(y|x) = \begin{cases} \frac{\pi_{\text{ref}}(y|x)}{\sum_{y': \text{ans}(y')=\text{maj}_0(Y_0)} \pi_{\text{ref}}(y'|x)} & \text{if ans}(y) = \text{maj}_0(Y_0) \\ 0 & \text{otherwise} \end{cases}\]

Interpretation: This convergence has profound implications: - When the initial majority aligns with correctness, convergence reinforces good solutions - When misaligned, the same mechanism systematically amplifies errors, leading to model collapse


Unified Reward Framework

The paper proposes a unified framework showing that all intrinsic rewards manipulate cross-entropy:

\[r_{\text{uni}}(x, y) = \psi\left[\sigma \sum_{i \in I} \mathbb{H}(q_i, \pi_\theta^i)\right], \quad \sigma \in \{+1, -1\}\]

where: - \(\mathbb{H}(q_i, \pi_\theta^i)\): Cross-entropy between anchor distribution \(q_i\) and model distribution \(\pi_\theta^i\) - \(I\): Aggregation granularity (token-level or answer-level) - \(\sigma\): Sign factor determining optimization direction - \(\psi\): Monotonic transformation (identity or exponential)

Key components explained:

Aggregation granularity \(I\): For token-level methods, \(I = \{1, \ldots, |y|\}\) where each element is a position in the sequence. For answer-level methods, \(I = \{A\}\) represents a single distribution over complete semantic answers.

Anchor distribution \(q_i\): Serves as a reference point. Different methods use different anchors: - Uniform distribution \(U_V\) (Self-Certainty): rewards departure from randomness - One-hot distribution \(\delta_t\) centered on generated token (Trajectory-Level Entropy): reinforces high-probability paths

Sign factor \(\sigma\): When \(\sigma = +1\), rewards increase with cross-entropy (departure from uniform). When \(\sigma = -1\), rewards decrease with cross-entropy (alignment with sharp anchors).

Reward-Confidence Monotonicity (Proposition 1): For methods with \(\sigma = -1\):

\[\pi_\theta(y_a|x) > \pi_\theta(y_b|x) \Longrightarrow r_{\text{uni}}(x, y_a) > r_{\text{uni}}(x, y_b)\]

This creates a self-reinforcing feedback loop: higher probability trajectories receive higher rewards, driving geometric concentration toward deterministic outputs.


Empirical Setup

Training framework: All experiments use the veRL framework with GRPO algorithm. Default hyperparameters (Table 7): - Training temperature: 1.0 - Global batch size: 64 - Mini-batch size: 64 - Rollout number: 8 - No KL/entropy regularization - Max prompt length: 1024 - Max response length: 7168 - Learning rate: \(10^{-6}\) - Epochs: 1

Evaluation benchmarks: - AIME 2024, AIME 2025: High-school competition math - AMC 2023: American Mathematics Competition - MATH-500: Held-out test set for fine-grained analysis

Metrics tracked: - Label Accuracy: Whether pseudo-label matches ground truth - Reward Accuracy: Sample-level agreement between pseudo-rewards and oracle rewards - Ground Truth Reward: Average oracle reward (supervised baseline) - Majority Voting Reward: Average pseudo-reward from majority voting - Actor Entropy: Entropy of the model's output distribution

Default evaluation: Generate 32 solutions per problem at temperature 0.6 with top-p 0.95, report average accuracy (avg@32).


The Rise and Fall of Intrinsic URLVR

Main finding (Figure 2): Intrinsic rewards initially match or exceed ground-truth training performance, but eventually collapse. The divergence manifests as: - Majority Voting Reward keeps rising (proxy reward optimization) - Reward Accuracy declines (pseudo-rewards diverge from correctness) - Validation performance declines (actual quality degrades)

This pattern exists across all hyperparameter settings tested, with the paper noting that collapse occurs around 1000 steps even with the most stable configurations.


Different Methods, Different Failures (Figure 3)

The paper reveals three distinct failure patterns:

Gradual degradation (Self-Certainty, Majority Voting): These methods degrade most slowly, maintaining higher validation performance and Label Accuracy. Self-Certainty is less aggressive because it sharpens against a uniform distribution rather than directly maximizing probability. Majority Voting operates at answer level, avoiding token-level artifacts.

Length collapse (Probability): The probability reward multiplies token probabilities, naturally favoring shorter sequences. The model becomes confident (lower Actor Entropy) but produces overly brief answers (shorter Mean Response Length).

Repetition collapse (Token-Level Entropy, Trajectory-Level Entropy): Averaging entropy across tokens encourages the model to pad sequences with repetitive high-probability tokens, since this minimizes per-position entropy.


Per-Problem Analysis (Figure 4, Figure 15)

Training on 25 individual MATH-500 problems reveals four patterns:

Amplifying success: Problems where greedy decoding starts correct remain correct; training increases confidence around the already-correct solution.

Amplifying failure: Problems where highest-reward samples are wrong throughout; training amplifies confidence in incorrect answers.

Wrong → Correct: Rare cases (12% of problems) where highest-reward samples guide the model from wrong to correct.

Correct → Wrong: Cases where inconsistent reward signals cause gradual degradation.

Key finding: Training flips greedy correctness in only 3 of 25 cases (12%). The remaining 22 simply sharpened initial preferences regardless of correctness, revealing training as amplification rather than correction.


Out-of-Distribution Generalization (Figure 5)

Surprisingly, even when training amplifies errors on in-distribution problems, it can still correct OOD problems. Training on 6 problems with wrong majority votes showed Test Label Accuracy on unseen problems increasing from 0 to 1, even though Training Label Accuracy remained near 0.

This indicates that sharpening generalizes: as long as confidence aligns with correctness on unseen problems, the amplification mechanism works even if it fails on training data.


Small Datasets Prevent Collapse (Figure 6, Figure 7)

Training with ≤128 samples maintains stable performance without collapse, while ≥512 samples consistently exhibits reward hacking. The paper hypothesizes that small datasets induce localized overfitting rather than systematic policy shift.

KL divergence measurements show DAPO-32 reaches only 0.057 KL after 600 steps, while DAPO-512 reaches 2× higher. Limited drift suggests the model sharpens confidence through localized parameter updates without altering global policy.


Test-Time Training as Safe Application (Figure 8)

Training on AMC23 (40 problems, test-time) avoids collapse, with both Ground Truth Reward and Majority Voting Reward rising and stabilizing. Performance improves on both AMC23 and AIME24. This contrasts with training on DAPO-17k (~17,000 problems), which shows the familiar rise-then-fall pattern.


Incorrect Majority Votes Still Improve Reasoning (Figure 9)

Even when almost all 32 training samples have incorrect initial majority votes, training still produces effective learning without catastrophic collapse. Gains on AIME24 and AMC23 demonstrate that small-scale training operates under fundamentally different dynamics than large-scale training.


Model Collapse Step as Trainability Indicator

The paper proposes Model Collapse Step as a practical indicator of model priors. Formally defined as the training step where Reward Accuracy drops below 1%, it measures how long a model can sustain intrinsic URLVR before collapse.

Comparison (Figure 11): Model Collapse Step correlates strongly with Ground Truth Gain (performance improvement from supervised RLVR), matching or surpassing pass@k's predictive power. Unlike pass@k, it cannot be gamed by random guessing on multiple-choice questions.

Efficiency (Table 3): Model Collapse Step requires 5.6× fewer tokens than full RL training while preserving relative ranking of models: - GT Gain: 6.66B tokens - Model Collapse Step: 1.19B tokens

Hyperparameter acceleration (Figure 12): Aggressive hyperparameters (mini-batch size = 1, rollout count = 8) accelerate collapse but preserve relative model rankings, enabling rapid assessment.


Self-Verification as External Reward (Figure 13, Figure 14)

The paper explores self-verification as an example of external rewards that exploit generation-verification asymmetry. On the Countdown task (forming arithmetic expressions to reach a target value): - Generating correct expressions is challenging - Verifying whether an expression equals the target is trivial

Results: Self-Verification achieves higher validation accuracy than Trajectory-Level Entropy. Reward Accuracy initially drops around step 200 as the policy tries to exploit the verifier, then recovers and stabilizes above 0.5, while Ground Truth Reward keeps rising. This recovery shows genuine learning rather than reward hacking.

Prompt sensitivity: Instruction-aligned models succeed with both verification prompts, while base models are highly sensitive. This suggests instruction-following capability is key to self-verification success.


4. Key Insights and Innovations

Innovation 1: Unified Theoretical Framework Revealing the Sharpening Mechanism

The paper's most fundamental contribution is establishing that all intrinsic URLVR methods share a common mechanism: they converge toward sharpening the model's initial distribution. This is achieved by proving:

  1. Theorem 1: Under KL-regularized optimization with binary rewards (majority voting), probability mass on the initial majority answer converges geometrically to 1.

  2. Unified Reward Framework: All intrinsic rewards can be expressed as manipulating cross-entropy between anchor distributions and model distributions, with sign factor \(\sigma\) determining whether the method rewards departure from uniformity (\(\sigma = +1\)) or alignment with sharp anchors (\(\sigma = -1\)).

  3. Proposition 1: Methods with \(\sigma = -1\) satisfy Reward-Confidence Monotonicity—higher probability trajectories receive higher rewards—creating a self-reinforcing feedback loop.

Why this matters: Prior work treated different intrinsic methods as unrelated approaches. This framework unifies them, explaining why diverse formulations (entropy minimization, majority voting, probability maximization) all exhibit similar failure patterns. The mechanism is structural, not incidental.

What makes it novel: The paper provides the first theoretical proof that intrinsic rewards converge to deterministic policies concentrated on initially preferred answers. This transforms the field's understanding from "intrinsic methods sometimes fail due to hyperparameters" to "intrinsic methods fundamentally cannot discover new knowledge."

Innovation 2: Confidence-Correctness Alignment as the Determining Factor

The paper establishes that the success of intrinsic URLVR depends on whether the model's initial confidence aligns with correctness. This is a subtle but profound insight:

  • When alignment exists (high-confidence answers are correct), sharpening acts as a beneficial amplifier
  • When alignment breaks down (high-confidence answers are wrong), the same mechanism reinforces errors

Empirical validation: Per-problem analysis shows that in 88% of cases (22/25), training simply sharpened initial preferences. The direction of change was determined by whether the highest-reward sample was correct, not by whether training "corrected" errors.

OOD generalization insight: Even when training amplifies errors on in-distribution problems, it can still benefit OOD problems—as long as confidence aligns with correctness on those unseen problems.

Why this matters: This resolves conflicting prior findings. Studies that found intrinsic methods work were testing on distributions where confidence aligned with correctness; studies that found failure were testing where alignment was weak.

Innovation 3: Systematic Characterization of Failure Patterns

The paper identifies three distinct failure modes for intrinsic methods:

  1. Gradual degradation (Self-Certainty, Majority Voting): Slow decline maintaining reasonable metrics
  2. Length collapse (Probability): Optimizing probability product rewards brevity, producing overly short answers
  3. Repetition collapse (Entropy-based methods): Minimizing average entropy encourages repetitive high-probability tokens

Why this matters: These patterns reveal that different mathematical formulations of "confidence" lead to qualitatively different failures. Probability collapse is about length; entropy collapse is about repetition; voting collapse is about consensus optimization.

Design implication: The paper shows that collapse timing is "determined by model prior rather than engineering choices." Hyperparameter tuning can delay but not prevent collapse—the fundamental limitation is the mechanism itself.

Innovation 4: Model Collapse Step as Practical Trainability Indicator

The paper proposes Model Collapse Step as a novel metric for predicting RL trainability:

  • Defined as the training step where Reward Accuracy drops below 1%
  • Requires no ground-truth labels (unlike standard RL evaluation)
  • Correlates strongly with Ground Truth Gain from supervised RLVR
  • Computed 5.6× faster than running full RL training
  • Robust to hyperparameter changes (preserves relative model rankings)

Why this matters: Practitioners evaluating multiple model candidates for RL currently must run expensive training runs. Model Collapse Step provides a diagnostic that captures the interaction between model priors and the learning process, enabling efficient pre-RLVR assessment.

What makes it novel: Prior metrics like pass@k measure static performance. Model Collapse Step measures training dynamics, capturing whether sharpening amplifies correct or incorrect predictions before collapse occurs.

Innovation 5: Safe Application Boundaries for Intrinsic URLVR

The paper establishes clear boundaries for safe application:

  • Safe: Test-time training on small datasets (≤128 samples), even with incorrect majority votes
  • Unsafe: Large-scale training (≥512 samples), where collapse is inevitable

The mechanism is localized overfitting: small datasets cause parameter updates that sharpen confidence on specific samples without altering global policy (KL divergence remains low). Large datasets require dense updates that shift global policy.

Why this matters: This provides actionable guidance. Intrinsic URLVR is not useless—it has legitimate applications in test-time adaptation scenarios where the small-data constraint is natural.

5. Experimental Analysis

Evaluation Methodology

Models tested: - Primary: Qwen3-1.7B-Base - Additional: Qwen2.5-1.5B, Qwen2.5-7B, Qwen3-4B-Base, Qwen3-8B-Base, DeepSeek-R1-Distill-Qwen-1.5B, Llama-3.1-8B, Llama-3.1-Tulu-3-8B-SFT, OLMo-2-1124-7B

Training datasets: - DAPO-17k: ~17,000 math problems (primary) - MATH-8k, DeepScaleR-40k, ORZ-56k: Additional comparisons - Single problems from MATH-500: Per-problem analysis - Countdown: 4k training, 1k validation for self-verification experiments

Evaluation benchmarks: - AIME 2024: 30 problems - AIME 2025: 30 problems
- AMC 2023: 40 problems

Training framework: veRL with GRPO algorithm

Hyperparameters swept: - Training temperature: {0.6, 0.8, 1.0, 1.2} - Mini-batch size: {1, 8, 16, 32, 64} - Rollout count: {4, 8, 16, 32} - KL regularization: with/without (\(\beta\) = 0.005)

Intrinsic methods tested: - Ensemble-based: Majority Voting - Certainty-based: Self-Certainty, Token-Level Entropy, Trajectory-Level Entropy, Probability

Main Quantitative Results

Rise and Fall Pattern (Figure 2)

Training Qwen3-1.7B-Base on DAPO-17k with majority voting reward:

Metric Early Phase (Step ~50) Late Phase (Step ~200)
Majority Voting Reward ~0.4 ~0.9
Reward Accuracy ~0.6 ~0.2
AIME 2024 Accuracy ~0.05 ~0.02
AIME 2025 Accuracy ~0.03 ~0.01
AMC 2023 Accuracy ~0.30 ~0.25

The proxy reward rises while validation accuracy falls, demonstrating divergence between optimizing confidence and correctness.

Different Methods, Different Failures (Figure 3)

Performance on AIME 2024 at step 200: - Self-Certainty: ~0.055 accuracy (most stable) - Majority Voting: ~0.04 accuracy - Trajectory-Level Entropy: ~0.035 accuracy - Token-Level Entropy: ~0.025 accuracy - Probability: ~0.015 accuracy (worst, due to length collapse)

Mean Response Length shows Probability collapsing to ~1500 tokens while others maintain 3000-6000 tokens.

Dataset Size Impact (Figure 6)

Dataset Size Ground Truth Reward (Step 600) Majority Voting Reward Collapse?
32 Stable ~0.45 Reaches 1.0 No
128 Stable ~0.35 Reaches ~0.9 No
512 Declining ~0.2 Rising Yes
2048+ Near zero Near 1.0 Yes

KL divergence (Figure 7): DAPO-32 reaches 0.057 KL, DAPO-512 reaches ~0.12 KL after 600 steps.

Model Collapse Step Prediction (Figure 11, Table 3)

Correlation with GT Gain: - Model Collapse Step: Strong positive correlation - Pass@k Gain: Weaker correlation, can be gamed by multiple-choice random guessing

Computation cost comparison: - GT Gain (full RL): 6.66B tokens total - Model Collapse Step: 1.19B tokens total (5.6× faster)

Collapse steps for 7 models with aggressive hyperparameters (MBS=1, N=8): [22, 14, 19, 112, 128, 172, 195]

Backbone Model Analysis (Figures 37-40)

Qwen family stability: SFT variants (DeepSeek-R1-Distill-Qwen-1.5B) maintain Reward Accuracy > 0.8 throughout training, while base models drop to near zero by step 200.

Llama family: All variants eventually collapse, with timing: Base (step 40) < Math Base < SFT < Instruct (latest collapse).

Counterintuitive scaling: Smaller models consistently outperform larger variants—Q3-1.7B maintains stability longer than Q3-4B; Octo-3B outlasts Octo-8B by ~40 steps.

Architectural generation: Qwen3 models exhibit superior stability compared to Qwen2.5 counterparts.

Self-Verification Results (Figure 13)

Countdown task validation accuracy: - Self-Verification (Q3-1.7B): ~0.65 final - Trajectory-Level Entropy (Q3-1.7B): ~0.35 final - Oracle Supervision (Q3-1.7B): ~0.75 final

Self-Verification shows recovery in Reward Accuracy after initial drop, while Ground Truth Reward keeps rising—indicating genuine learning.

TTRL Capability Improvement (Table 9)

Comparison on AIME 2024: - Qwen2.5-Math-1.5B base maj@1024: 37.30% - Qwen2.5-Math-1.5B w/ TTRL avg@32: 48.90% (+11.60%)

  • Qwen2.5-Math-7B base maj@1024: 50.79%
  • Qwen2.5-Math-7B w/ TTRL avg@32: 68.10% (+17.31%)

TTRL-trained models significantly exceed the base model's majority-vote performance upper bound, demonstrating genuine capability improvement beyond consistency alignment.

Ablation Studies and Robustness Checks

Hyperparameter sensitivity (Figures 16-35):

Temperature: Lower temperature (0.6-0.8) accelerates collapse; higher (1.2) delays collapse but increases noise; optimal at 1.0.

Mini-batch size: Size 1 causes collapse within 20 steps; size 64 (pure on-policy) provides maximum stability.

KL regularization: Adding KL (\(\beta\) = 0.005) provides only marginal benefits, increasing variance without preventing collapse.

Rollout count: Larger counts (N ≥ 16) accelerate collapse; N ≤ 8 maintains stability over full epoch.

Cross-validation for Model Collapse Step: Rankings preserved across different hyperparameter settings, enabling aggressive settings for faster assessment.

Seed verification: Results verified across 3 random seeds for subset sizes {32, 128, 512}. DAPO-32 never collapses; DAPO-512 always collapses.

Assessment: Do the Experiments Support the Claims?

Claim 1: Intrinsic rewards converge toward sharpening initial distribution. Strongly supported by theoretical derivation (Theorem 1) and empirical validation. Tables 4-5 show monotonic increase of \(p_{\text{maj}}\) toward near-complete concentration (98.54%-99.80%) over 50 training steps. Per-problem analysis confirms amplification of initial preferences in 88% of cases.

Claim 2: Rise-then-fall pattern is universal across methods. Supported by Figure 3 showing five different methods all eventually degrade. The paper notes that "collapse timing is determined by model prior rather than engineering choices" — hyperparameter tuning affects when but not whether collapse occurs.

Claim 3: Small datasets prevent collapse. Strongly supported by Figure 6 showing stable performance for ≤128 samples versus collapse for ≥512 samples. KL divergence analysis (Figure 7) provides mechanistic explanation. Robustness check with incorrect majority votes (Figure 9) shows gains even when training amplifies errors.

Claim 4: Model Collapse Step predicts RL trainability. Supported by Figure 11 showing strong correlation with GT Gain. Efficiency claim of 5.6× is validated in Table 3. The paper correctly notes this enables efficient pre-RLVR assessment without ground-truth labels.

Claim 5: External rewards (self-verification) escape intrinsic limits. Preliminarily supported by Figure 13 showing Self-Verification outperforming Trajectory-Level Entropy and exhibiting recovery dynamics rather than collapse. However, the paper appropriately frames this as "preliminary evidence" and notes dependency on instruction-following capability.

Potential weaknesses:

  1. Single primary model: Most experiments use Qwen3-1.7B-Base; while additional models are tested, the depth of analysis for secondary models is limited.

  2. Math domain focus: All primary experiments are on mathematical reasoning; generalization to other domains (code, general reasoning) is not systematically tested.

  3. Hyperparameter tuning depth: Each method uses separately tuned hyperparameters without systematic comparison of optimal configurations across methods.

  4. Self-verification prompt sensitivity: Figure 14 shows base models are highly sensitive to prompt choice while instruction models are robust; this limitation is noted but not deeply explored.

  5. Compute cost for difficulty estimation not included: While Model Collapse Step is efficient, the cost of computing collapse step is included in the budget, whereas the paper doesn't account for potential pre-computation needs.

6. Limitations and Trade-offs

Assumption: Majority Stability Holds Throughout Training

The theoretical analysis in Theorem 1 relies on assumption (A1): that the majority answer remains stable across iterations (\(\text{maj}_k(Y_k) = \text{maj}_0(Y_0)\) for all \(k\)). The paper validates this empirically with \(N = 1024\) rollouts in Appendix A.1.1, showing the majority never flipped across 200 iterations. However, this validation is limited to a small set of problems and may not generalize to:

  • Low-confidence regimes: When initial \(p_{\text{maj}}^{(0)}\) is close to 0.5 (competitive answers), the majority could flip with policy updates
  • High-entropy models: Models with less calibrated uncertainty estimates may exhibit unstable voting patterns
  • Small rollout counts: The paper notes that "even moderate \(N\) (we use \(N = 8\)) provides reasonable stability," but provides no quantitative threshold for when stability breaks down

The assumption is reasonable for well-calibrated models on problems with clear majority answers, but practitioners should verify stability in their specific setting.

Assumption: Non-Trivial Progress at Each Step

Assumption (A2) requires that each gradient update makes "non-trivial progress" (\(\eta_k \geq \eta_{\text{min}} > 0\)). The paper validates this empirically through monotonic increase of \(p_{\text{maj}}\) in Tables 4-5, but this is inherently circular—the validation uses the same training dynamics the theorem predicts.

The paper notes: "We assume that policy gradient updates with positive learning rate satisfy: if \(r(y^*) > r(y')\) and both have positive probability, then the updated policy satisfies \(\frac{\pi_{k+1}(y^*)}{\pi_{k+1}(y')} \geq \frac{\pi_k(y^*)}{\pi_k(y')}\)." This is presented as a "standard assumption in policy gradient methods," but for practitioners, this means:

  • Learning rate matters: Very small learning rates may violate effective progress
  • Architecture matters: Some model architectures may have flat loss landscapes that impede progress

Single Domain Focus: Mathematical Reasoning

All primary experiments are conducted on mathematical reasoning benchmarks (AIME, AMC, MATH). The paper acknowledges this focus is deliberate because "test-time compute is expected to help most when the model already possesses the necessary knowledge and the challenge is drawing complex inferences—mathematical reasoning fits this profile." However, this creates limitations:

  • Code generation: Not tested, despite being a natural domain for URLVR (execution provides verification)
  • Open-ended reasoning: Tasks without clear answer formats (e.g., essay writing, creative generation) lack suitable voting mechanisms
  • Factual QA: Tasks requiring knowledge retrieval rather than inference may exhibit different dynamics

The paper's extension to Countdown (Section 7) begins to address this, but the systematic breadth of the math experiments is not replicated across domains.

Model Coverage: Primarily Qwen3-1.7B-Base

While Section 6.1 and Appendix C.2 test 11 models across Qwen and Llama families, the deep analysis (hyperparameter sweeps, per-problem tracking, failure mode analysis) is conducted primarily on Qwen3-1.7B-Base. The paper argues this model is "representative," but:

  • Scale effects: The counterintuitive finding that "smaller models consistently outperform larger variants" (Figure 39) suggests scale matters in non-obvious ways
  • Architecture effects: Qwen family shows fundamentally different stability patterns than Llama (SFT enables stability in Qwen, both eventually collapse in Llama)
  • Training stage effects: The paper shows math-specialized and SFT models demonstrate superior stability, but the mechanism for why remains unclear

Practitioners working with non-Qwen models or larger scales should validate findings in their specific context.

Model Collapse Step: Efficiency Versus Applicability

The Model Collapse Step metric is proposed as a 5.6× faster alternative to running full RL training for assessing trainability. However, several limitations constrain its applicability:

Requires running intrinsic URLVR: While faster than full RL, computing Model Collapse Step still requires: - Training setup with intrinsic rewards - Sufficient compute budget to reach collapse (22-195 steps across 7 models in Figure 12) - Monitoring Reward Accuracy at each step

Binary threshold choice: The definition uses "Reward Accuracy drops below 1%" as the threshold. The paper provides no sensitivity analysis for this choice—why 1% rather than 5% or 0.1%? Different thresholds could yield different rankings.

Multiple-choice gaming resistance is partial: The paper correctly notes pass@k can be "gamed by random guessing on multiple-choice questions," but Model Collapse Step's resistance is not thoroughly tested. For problems with small answer spaces, similar gaming could occur.

Not validated for external reward methods: Model Collapse Step measures collapse under intrinsic URLVR. Its predictive power for external reward methods (self-verification, unlabeled data methods) is unknown.

Self-Verification Dependency on Instruction Alignment

Figure 14 reveals that instruction-aligned models succeed with both verification prompts (P1 and P2), while base models only work with P2. This "prompt sensitivity" raises practical concerns:

  • Requires instruction-tuned models: The benefits of self-verification may not transfer to pure base models without careful prompt engineering
  • Prompt engineering burden: Practitioners need to develop and validate verification prompts for their specific domain
  • Robustness unclear: The paper tests only two prompts; whether these generalize across tasks is unknown

The paper frames this as demonstrating that "instruction-following capability is key to self-verification success," but this also limits applicability to models with strong instruction-following abilities.

Hyperparameter Interactions Not Fully Explored

The hyperparameter sweeps in Appendix B.3 vary one parameter at a time while holding others fixed. This reveals:

  • Temperature effects depend on method (Self-Certainty prefers 1.0, others prefer 1.2)
  • Mini-batch size effects are consistent across methods
  • KL regularization provides "marginal benefits" across methods

However, the interactions between parameters are not systematically explored. For example: - Does the optimal temperature change with different mini-batch sizes? - Do entropy-based methods benefit from KL regularization more than probability-based methods? - How do rollout count and temperature interact?

Practitioners tuning for specific domains may need more comprehensive interaction analysis.

External Reward Methods: Preliminary Evidence Only

The paper positions external reward methods as "the more promising direction for long-run URLVR scaling," but the empirical evidence is limited:

  • Self-verification: Tested only on Countdown task with 4k training/1k validation samples
  • No comparison to intrinsic on same task: Figure 13 compares self-verification to trajectory-level entropy on Countdown, but majority voting (the primary intrinsic method) is not tested
  • No scaling analysis: The paper's central question is "how far can URLVR scale," but external methods are not subjected to the same scaling analysis (dataset size sweeps, step counts)

The paper appropriately frames these as "preliminary evidence," but the asymmetry between the depth of intrinsic analysis and external exploration leaves the claim of "more promising direction" partially unvalidated.

Theoretical Framework: \(\sigma = +1\) Methods Require Separate Analysis

Proposition 1 establishes Reward-Confidence Monotonicity for methods with \(\sigma = -1\) (Self-Certainty uses \(\sigma = +1\)). The paper notes:

"While methods with \(\sigma = +1\) do not strictly align reward with raw confidence, they still induce sharpening by penalizing high-entropy distributions... Its sharpening mechanism requires separate analysis."

This separate analysis is not provided in the main text. While empirical results (Self-Certainty shows the most stable behavior in Figure 3) support the claim that it still sharpens, the theoretical mechanism is incomplete.

Computational Cost of Difficulty Estimation Not Included

While Model Collapse Step is shown to be 5.6× faster than running full RL (Table 3), the paper does not account for:

  • Initial hyperparameter tuning: Selecting aggressive settings (MBS=1, N=8) for rapid assessment
  • Multiple runs: Computing collapse step for multiple model candidates
  • Storage and tracking: Monitoring Reward Accuracy at each step

For practitioners with very limited compute, even 1.19B tokens may be prohibitive for rapid model selection.


7. Implications and Future Directions

How This Work Changes the Landscape

From isolated methods to unified understanding. Prior to this work, URLVR methods (TTRL, EM-RL, RLIF, RLSF, etc.) were studied as independent approaches with different mathematical formulations and empirical behaviors. This paper transforms the landscape by:

  1. Establishing a unified theoretical framework showing all intrinsic methods manipulate cross-entropy between anchor and model distributions
  2. Proving convergence to sharpening demonstrates that diverse methods share a common—and fundamentally limited—mechanism
  3. Identifying confidence-correctness alignment as the determining factor for success

This shifts the field's research agenda from "designing better intrinsic reward functions" to "understanding when existing methods work" and "exploring fundamentally different reward sources."

Resolution of conflicting prior findings. The paper resolves why some studies found intrinsic URLVR works while others found it fails. The reconciliation mechanism—confidence-correctness alignment—provides a clear boundary condition. Studies like Zuo et al. (2025) showing TTRL improvements were operating in regimes where alignment existed; studies showing failure were in regimes where it broke down.

Practical guidance for deployment. The paper's findings have immediate practical implications:

  • Safe application domain: Test-time training on small datasets (≤128 samples) is identified as a safe and effective use case
  • Unsafe application domain: Large-scale training (≥512 samples) is identified as fundamentally limited
  • Model selection: Model Collapse Step provides a practical tool for practitioners evaluating candidates for RL training

This moves URLVR from "promising but risky" to "useful in specific contexts with clear boundaries."

Follow-Up Research This Work Enables

1. Systematic study of external reward methods. The paper positions external rewards as "the more promising direction" but provides only preliminary evidence. Follow-up work should:

  • Apply the same systematic scaling analysis (dataset size sweeps, hyperparameter tuning, failure mode characterization) to external methods
  • Test self-verification on domains beyond Countdown (code generation with execution verification, theorem proving with proof assistants)
  • Compare intrinsic and external methods on identical tasks and compute budgets

2. Mechanistic understanding of model prior effects. The paper shows that model family, training stage, and scale affect URLVR stability, but the mechanisms remain unclear:

"Qwen family (left), the SFT variant maintains Reward Accuracy above 0.8 throughout training while the base model drops to near zero by step 200... In Llama family (right), both variants eventually collapse but at different rates."

Why does SFT enable stability in Qwen but not Llama? Why do smaller models outperform larger ones? Follow-up work could investigate:

  • Calibration analysis: Do stable models have better confidence-correctness alignment?
  • Representation analysis: What internal representations predict URLVR stability?
  • Training dynamics: How do different pretraining objectives affect URLVR behavior?

3. Improving Model Collapse Step as a metric. While promising, Model Collapse Step has limitations that suggest research directions:

  • Threshold optimization: Systematic study of what threshold (1%, 5%, dynamic) optimally predicts RL gains
  • Early prediction: Can collapse be predicted before it occurs using early training dynamics?
  • Domain generalization: Does Model Collapse Step computed on one domain predict performance on others?

4. Combining intrinsic and external rewards. The paper studies intrinsic and external methods independently, but combinations are unexplored:

  • Could intrinsic rewards be used for early training stabilization while external rewards provide long-term signal?
  • Can majority voting filter outputs before external verification (reducing verification cost)?
  • What is the optimal transition point from intrinsic to external?

5. Theoretical extensions. The theoretical framework opens several research questions:

  • Non-stationary rewards: What happens when the reward function itself evolves (e.g., self-verification improving over time)?
  • Multi-modal distributions: The analysis assumes convergence to a single majority; what if the model has multiple valid solution modes?
  • Theoretical bounds for external methods: Can similar convergence analysis characterize external reward methods?

Practical Applications and Downstream Use Cases

Test-time adaptation for domain-specific deployment. The clearest practical application is test-time training on small datasets:

"Training with 32 or 128 samples maintains stable performance without collapse, while larger datasets (≥512) consistently exhibit reward hacking."

Practitioners deploying models to specialized domains (medical reasoning, legal analysis, scientific computation) can use intrinsic URLVR to adapt models to domain-specific evaluation sets without ground-truth labels. The small-data constraint naturally fits scenarios where evaluation sets are limited.

Model selection for RL training pipelines. Organizations planning large-scale RL training can use Model Collapse Step to screen candidates:

  1. Compute collapse step for each candidate model using aggressive hyperparameters (MBS=1, N=8)
  2. Select models with largest collapse steps for full RL training
  3. Reduce wasted compute on models unlikely to benefit from RL

The 5.6× efficiency gain makes this practical even for large model families.

Diagnostic for existing RL pipelines. Teams already running RLVR can use the paper's framework to diagnose failures:

  • If performance rises then falls, check Reward Accuracy to distinguish intrinsic URLVR failure from other issues
  • If collapse occurs early, consider whether training set is too large (>512 samples suggests using smaller subsets)
  • Monitor Actor Entropy to detect premature convergence

Iterative self-improvement loops. The TTRL results in Table 9 show that trained models can exceed the base model's majority-vote upper bound. This suggests applications in:

  • Automated training data generation: Use TTRL to improve models that then generate higher-quality synthetic data
  • Continuous improvement: Periodically apply test-time training on evaluation sets to maintain performance

Reproducibility and Integration Guidance

For researchers reproducing these findings:

  1. Framework: Use veRL with GRPO algorithm, implementing reward functions per Tables 1-2
  2. Default hyperparameters (Table 7): Temperature 1.0, batch size 64, mini-batch size 64, rollouts 8, learning rate \(10^{-6}\)
  3. Key metrics to track: Label Accuracy, Reward Accuracy, Ground Truth Reward, Actor Entropy, Mean Response Length
  4. Model family matters: Qwen family recommended for stability; Llama family will collapse regardless of training stage

For practitioners integrating intrinsic URLVR:

When to use intrinsic methods: - Small datasets (≤128 samples): Test-time training scenarios - Domain-specific adaptation: When you have a small evaluation set without labels - Model selection: Using Model Collapse Step as a screening metric

When to avoid intrinsic methods: - Large-scale training (≥512 samples): Collapse is inevitable - Problems with poor confidence-correctness alignment: When the model's confident predictions are often wrong - High-stakes applications: When errors from collapse have severe consequences

Hyperparameter recommendations for stability:

  • Temperature: 1.0 for majority voting; 1.0-1.2 for entropy-based methods
  • Mini-batch size: 64 (pure on-policy) for maximum stability
  • Rollout count: 8 for balance between reliability and stability
  • KL regularization: Not recommended (marginal benefit, adds overhead)

For practitioners exploring external reward methods:

  • Self-verification: Requires instruction-tuned models; test prompt sensitivity
  • Countdown-style tasks: Exploit generation-verification asymmetry where verification is cheap
  • Instruction-following: Key capability for self-verification; base models may need careful prompt engineering

Integration with existing RLVR pipelines:

  1. Hybrid approach: Use intrinsic rewards for warm-start (first 50-100 steps), then switch to external rewards if available
  2. Filtering: Use majority voting to filter outputs before expensive external verification
  3. Monitoring: Track Reward Accuracy throughout training; switch strategies if it drops below 0.5