THE INVISIBLE LEASH? WHY RLVR MAY OR MAY NOT ESCAPE ITS ORIGIN¶

🎯 Pitch¶

This paper introduces an "empirical support" framework and large-scale evaluations to show that common RLVR training largely sharpens probability mass onto solutions already present in the base model rather than discovering genuinely new reasoning modes. That matters because RLVR’s apparent gains in single-shot accuracy can mask a narrowing of reachable correct solutions—undermining long-run exploration, multi-sample robustness, and the claim that RLVR expands a model’s true reasoning boundary.

1. Executive Summary (2-3 sentences)¶

This paper investigates whether today’s common recipe for RLVR (Reinforcement Learning with Verifiable Rewards) actually expands a base language model’s reasoning coverage, or mostly sharpens the probability mass onto solutions the base model could already produce. Using a new “empirical support” lens and large sampling budgets across many reasoning benchmarks, it finds that RLVR reliably improves single-try accuracy (e.g., pass@1 / avg@32) but more often shrinks the set of correct solutions that remain reachable under finite sampling than it expands it (Table 1; Figures 2–4). It also shows an entropy pattern where token-by-token uncertainty can increase while the distribution over final answers becomes less diverse, suggesting “local stochasticity without global exploration” (Section 4; Table 3).

2. Context and Motivation¶

Problem / gap.
RLVR has become a prominent post-training method for “reasoning models,” using automatically checkable rewards (e.g., correct/incorrect) to optimize a model’s outputs (Introduction; Section 2.1).
A key open question is whether standard RLVR:
- expands the model’s reasoning boundary (discovering qualitatively new correct solution modes), or
- mainly improves precision by amplifying already-available high-reward outputs while potentially suppressing alternative correct solutions (Introduction).
Why it matters.
If RLVR mostly redistributes probability mass within what the base model already “effectively” knows, then RLVR improvements at pass@1 may not translate into broader capability gains—especially when users can sample multiple attempts or when tasks have multiple correct solution routes (Introduction; Section 3.2; Section 4.2).
Prior approaches and where they fall short (as framed here).
The paper highlights an empirical pattern seen in prior observations: RLVR-tuned models often win at low sampling budgets (pass@1) but can lose at high budgets (pass@k for large k), suggesting reduced coverage (Introduction; Section 3.2 gives concrete examples like AIME2024).
pass@k is acknowledged as an imperfect proxy for “reasoning boundary,” but it is used as a practical lens consistent with prior evaluation practice (Introduction).
How this paper positions itself.
It proposes measurement tools—empirical support categories and derived metrics—to separate:
- “precision sharpening” (better top-1) from
- “coverage expansion” (new correct solutions becoming reachable) (Section 2.2; Figure 1).
It then tests those tools across multiple RLVR model families and domains (Section 3.1; Table 1).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is an evaluation framework that compares a base model distribution q(y|x) to an RLVR-trained model distribution πθ(y|x) by examining which correct completions each model can actually reach under finite sampling.
It solves the problem of distinguishing “RLVR improves accuracy by sharpening” from “RLVR expands what solutions are discoverable,” using large-k sampling, a probability cutoff ϵ, and support-dynamics metrics (Sections 2–3; Figure 1; Table 1).

3.2 Big-picture architecture (diagram in words)¶

Model pairs. Choose a base model q and its RLVR counterpart πθ (Section 3.1; Table 1 lists multiple families/scales).
Benchmarks + verifiers/judges. Run on math and non-math reasoning datasets; correctness is verifiable (sometimes via a judge for SimpleQA) (Section 3.1).
Sampling engine. For each prompt x, sample k completions y from each model with controlled decoding parameters (Appendix B.1).
Correctness filtering. Identify which sampled completions are correct (R(x,y)=1) (Section 2.1).
Empirical-support analysis. Using a detectability cutoff ϵ, categorize correct solutions into preservation / shrinkage / expansion / out-of-support (Section 2.2; Figure 1).
Aggregate metrics + diagnostics. Compute SRR/NDR/SDS/NSCR (Section 2.2; Table 1), analyze pass@k curves (Figures 2–4), perplexity vs reference traces (Table 2), and entropy at token vs answer level (Section 4; Table 3).

3.3 Roadmap for the deep dive¶

First, define the objects being measured: distributions, rewards, and “empirical support” vs theoretical support (Section 2.1).
Next, explain the support dynamics categories and the derived metrics used to summarize them (Section 2.2).
Then, walk through the experimental pipeline: models, datasets, sampling budgets, and decoding/evaluation details (Section 3.1; Appendix B).
After that, summarize the main empirical findings on preservation/expansion/shrinkage, including high-k paradoxes (Section 3.2; Figures 2–4; Tables 1, 4–10).
Finally, explain the distributional diagnostics (entropy/perplexity) and the theoretical support-bounded arguments (Section 4; Table 2–3; Appendix C).

3.4 Detailed, sentence-based technical breakdown¶

Framing (what kind of paper + core idea).
This is primarily an empirical measurement paper with supporting theoretical analysis that treats common RLVR as a support-constrained optimizer: under on-policy sampling, RLVR tends to reweight probability mass among solutions the base model already samples, rather than creating robust access to truly novel solution modes (Section 3.2; Appendix C.1–C.2).
Key formal setup: distributions, rewards, and RLVR objective.
For a prompt x ∈ X and completion y ∈ Y, the base model defines a distribution q(y|x) and the RLVR-trained model defines πθ(y|x) (Section 2.1).
A verifiable reward function R(x,y) ∈ {0,1} marks correctness, and the set of correct completions is C = {y : R(x,y)=1} (Section 2.1).
The paper writes a generic KL-regularized RLVR objective (used to describe PPO/GRPO-like variants) as: > maximize over θ: \n> E_{x∼D, y∼πθ(·|x)} [ R(x,y) − β^{-1} log ( πθ(y|x) / q(y|x) ) ] (Section 2.1)
where β>0 controls the strength of the penalty for moving away from the base model.
Why “support” needs an empirical relaxation.
In theory, the support of correct completions under distribution p is supp(p) = { y ∈ C : p(y|x) > 0 } (Definition 2.1).
In practice, softmax models assign nonzero probability nearly everywhere, so “support” becomes trivial; what matters is whether a correct completion has enough probability to be observed under finite rollouts (“empirical support diffusion” discussion, Section 2.1).
The paper defines an ϵ-thresholded empirical support: > supp_ϵ(q) = { y ∈ C : q(y|x) > ϵ } (Section 2.1)
Intuitively, ϵ acts as a detectability cutoff: if q(y|x) is below ϵ, you likely will not see that completion even with large but finite sampling (Section 2.1; Appendix C.4 derives a bound).
System/data pipeline diagram in words (explicit step-by-step).
Select a benchmark prompt x from a dataset distribution D (math or non-math tasks in Section 3.1; Appendix B.2).
Sample k completions from the base model by drawing y_1,…,y_k ~ q(·|x) using fixed decoding settings (temperature 0.6, top_p=0.95, max length 32768) with vLLM inference (Appendix B.1).
Sample k completions from the RLVR model similarly: y'_1,…,y'_k ~ πθ(·|x) with the same decoding settings (Appendix B.1).
Determine correctness for each sampled completion using a verifiable procedure; for SimpleQA, the evaluation uses GPT-4.1 as judge (Section 3.1).
Collect the set of correct completions observed for each model at that prompt, and map them to empirical support membership based on whether their probability mass is above ϵ (conceptual definition in Section 2.1; categorization in Figure 1 / Section 2.2).
Categorize correct completions into four regions (Figure 1; Definitions 2.2–2.3):
- Preservation (P): correct solutions that are empirically supported by both base and RLVR models.
- Shrinkage (S): correct solutions empirically supported by the base but not by RLVR.
- Expansion (E): correct solutions not empirically supported by the base but supported by RLVR.
- Out of support (O): correct solutions supported by neither under the sampling budget / threshold.
Aggregate counts across prompts to compute summary metrics:
- SRR = P/(P+S) (Support Retention Rate),
- NDR = E/(P+E) (Net Discovery Rate),
- SDS as a harmonic-mean style balance between retention and discovery,
- NSCR = (E−S)/(P+E+S) (net change in empirical support) (Definition 2.3).
Run complementary diagnostics:
- pass@k as a function of k on selected tasks (Figures 2–4),
- perplexity on reference reasoning traces to probe compatibility with “external” solution styles/modes (Table 2; Section 3.2),
- token-level vs answer-level entropy at k=32 to separate local uncertainty from global diversity (Section 4.1–4.2; Table 3; Appendix B.4).
A concrete micro-example of the support categories (illustrative, consistent with the definitions).
Suppose for a prompt x there are two distinct correct completions y_A and y_B.
If the base model has q(y_A|x)=5e-4 and q(y_B|x)=1e-5, and the detectability cutoff is ϵ=3e-4 (Appendix C.4 gives a method to estimate magnitude from k and confidence), then only y_A is in supp_ϵ(q).
If RLVR shifts to πθ(y_A|x)=1e-4 and πθ(y_B|x)=4e-4, then:
- y_A becomes shrinkage (base had it above ϵ, RLVR pushed it below),
- y_B becomes expansion (base had it below ϵ, RLVR pushed it above),
- and the net effect depends on whether such expansions are more or less frequent than shrinkages across tasks (captured by NSCR and reflected empirically in Table 1).
Experimental setup details the paper specifies (and what it does not).
Models compared. The main RLVR method analyzed is ProRL-1.5B built from DeepSeek-R1-Distill-Qwen-1.5B, and additional RLVR families include Nemotron (7B/14B), Skywork-OR1-7B, Phi4-Reason-Plus-14B, and a vision-language model Kangheng-OVR-7B (VLM) (Section 3.1; Table 1; Tables 4–10).
Datasets.
- Math: MATH500, Minerva, OlympiadBench, AIME 2024, AIME 2025, AMC 2023 (Section 3.1; Appendix B.2).
- Non-math: SimpleQA, LiveBench subsets (R/C/L), SciBench, and Reasoning Gym (Section 3.1; Appendix B.2).
- VLM math: MathVista and MathVision testmini sets (Appendix B.2; Table 10).
Sampling budgets (critical to the “empirical support” concept).
- Math tasks: k ∈ {4096, 8192} (Section 3.1).
- Reasoning Gym: k ∈ {1024, 2048, 4096, 8192, 16384} (Section 3.1).
- Other non-math: typically k ∈ {1024, 2048} (Section 3.1).
Decoding/inference configuration.
- temperature=0.6, top_p=0.95, max_response_length=32768, inference backend vLLM (Appendix B.1).
Missing training hyperparameters.
- The paper does not provide optimizer type/settings, learning-rate schedule, batch size, number of training steps, training tokens, or compute budget for the RLVR runs in the provided content. It describes algorithmic ingredients at a high level (e.g., GRPO with decoupled clipping, dynamic sampling, KL regularization, periodic reference resets) for ProRL’s framework (Section 3.1), but not the concrete hyperparameter values.
Reasoning Gym evaluation corrections (important for interpretation).
The paper notes that the base model can be unfairly penalized by formatting mismatches on Reasoning Gym because ProRL was trained on that format, so it implements enhanced answer extraction and some task-specific normalizations to reduce format-driven evaluation artifacts (Appendix B.3).
This matters because “support shrinkage/expansion” depends on what is counted as a correct completion, so extraction/normalization can change P/E/S/O counts (Appendix B.3).
Entropy measurement methodology (how they compute it).
Token-level entropy is computed by teacher-forcing: after generating 32 samples, the sequence is fed back in one forward pass to read off the token distribution at each step and average entropies across steps and samples (Section 4.1; Appendix B.4).
Answer-level entropy is computed by extracting final answers from the 32 completions, forming the empirical distribution over unique answers, and computing Shannon entropy of that discrete distribution (Section 4.1; Appendix B.4).
Theoretical support-bounded arguments (what mechanism is claimed).
The paper formalizes a support-preservation claim for on-policy policy-gradient RLVR: > supp(πθ(·|x)) ⊆ supp(q(·|x)) (Theorem C.1, Appendix C.1)
meaning if the base model assigns exactly zero probability to a correct completion, RLVR cannot discover it because it never gets sampled and thus never contributes gradient.
Because exact zeros are rare with softmax models, it extends the argument to ϵ-empirical support under finite sampling, arguing that with finite steps and budgets, RLVR remains bounded by what is detectable under q (Theorem C.3, Appendix C.1).
It also gives a variational view: RLVR behaves like an exponential “tilt” of the base distribution by reward: > π*(y|x) ∝ q(y|x) · exp(β R(x,y)) (Proposition C.4, Appendix C.2)
and in a KL-free limit, it approaches q renormalized over correct completions (Corollary C.5, Appendix C.2).
This directly suggests “distribution sharpening” rather than mode creation.

4. Key Insights and Innovations¶

(1) “Empirical support” as a measurable proxy for reachable reasoning solutions.
Novelty: Instead of treating model support as a theoretical set (trivial under softmax), it introduces supp_ϵ to represent solutions that are practically reachable under finite sampling (Section 2.1; Figure 1).
Significance: This reframes RLVR evaluation from “does accuracy improve?” to “does the set of reachable correct solutions expand or shrink?” (Sections 2.2–3.2).
(2) A four-way taxonomy + metrics for support dynamics.
Novelty: The paper defines Preservation, Shrinkage, Expansion, and Out of Support for correct completions, then summarizes behavior via SRR, NDR, SDS, and NSCR (Definitions 2.2–2.3).
Significance: These metrics separate “RLVR keeps what it had” (high SRR) from “RLVR finds genuinely new reachable solutions” (high NDR), exposing cases where RLVR improves pass@1 while still having net negative support change (Table 1).
(3) Cross-domain evidence that shrinkage typically outweighs expansion under current recipes.
Novelty: The central claim is not just observed in one domain; it is evaluated across math, logic, factual QA, coding, and even VLM math benchmarks (Section 3.1; Table 1; Tables 4–10).
Significance: This supports the paper’s “invisible leash” framing: RLVR is strongly tied to the base model’s effective distribution under these training/evaluation conditions (Section 3.2; Conclusion).
(4) Decoupling between token-level entropy and answer-level entropy.
Novelty: It explicitly measures both and shows they can move in opposite directions—token-level entropy can increase while answer-level entropy decreases (Section 4.2; Table 3).
Significance: This undermines a common intuition that “more stochastic decoding paths” necessarily imply more exploration of distinct solutions, introducing the phrase/idea “local stochasticity without global exploration” (Section 4.2).

5. Experimental Analysis¶

Evaluation methodology (datasets, metrics, setup).
Primary metrics.
- pass@k curves and high-k comparisons are used as a proxy for solution accessibility (Introduction; Section 3.2; Figures 2–4).
- Support dynamics metrics SRR/NDR/SDS/NSCR summarize how reachable correct completions change (Section 2.2; Table 1; Tables 4–10).
- avg@32 is used for medium-budget accuracy comparisons in entropy analysis (Section 4.1–4.2; Table 3).
- Perplexity is used as an auxiliary diagnostic against base traces, RLVR traces, and external references (Section 3.2; Table 2).
- Entropy metrics: TokenEntropy and AnswerEntropy (Section 4.1; Table 3).
Decoding controls. All inference uses temperature=0.6, top_p=0.95, and max_response_length=32768 with vLLM (Appendix B.1).
Sampling budgets. Up to k=8192 for math and k=16384 for Reasoning Gym (Section 3.1), which is crucial because the paper’s thesis is about large sampling regimes revealing hidden coverage differences.
Main quantitative results (with specific numbers).
Aggregate support dynamics show high retention but near-zero discovery.
- Across models/domains in Table 1:
- SRR is very high (roughly 0.93–0.99 overall across listed models).
- NDR is very low (typically ≤ 0.04, often around 0.00–0.02).
- NSCR is slightly negative in most cases, indicating net shrinkage (Table 1).
- Example (ProRL-1.5B-v2 overall): SRR=0.93, NDR=0.02, SDS=0.04, NSCR=-0.05, with counts P=2388, E=48, S=175, O=793 (Table 1).
Concrete “high-k paradox” example (base beats RLVR at large k).
- For AIME2024 with ProRL-1.5B-v1 at pass@8192, base is 93.3% while RLVR is 83.3% (Table 4; also discussed in Section 3.2).
- For ProRL-1.5B-v1 on Minerva (pass@8192), base is 71.7% while RLVR is 63.6% (Table 4).
Pass@k curve shapes illustrate preservation vs expansion vs shrinkage.
- Preservation-typical tasks (Reasoning Gym examples) show RLVR accelerating convergence toward high pass@k but not necessarily changing what is ultimately reachable (Figure 2).
- Some tasks show expansion behavior (e.g., boxnet, dice, arc_1d examples) where RLVR improves across k in a way interpreted as bringing new modes above detectability (Figure 3).
- Shrinkage examples show base eventually overtaking RLVR at larger k (Figure 4).
Perplexity against external reasoning traces suggests limited compatibility with “outside-support” styles/modes.
- On unsolved-by-both problems, perplexity vs an external reference labeled “Claude Sonnet 4” increases substantially from base to ProRL, e.g.:
- AIME2024: base 8.76 → ProRL 14.91 (Table 2).
- AIME2025: base 6.05 → ProRL 9.76 (Table 2).
- The paper cautions that style/format differences can also contribute to perplexity gaps, not only “support” (Section 3.2).
Entropy + accuracy trade-off at k=32 on math benchmarks.
- Accuracy (avg@32) increases under RLVR variants:
- DeepSeek-1.5B avg 54.54% → ProRL-1.5B avg 65.43% (Table 3).
- Qwen2.5-32B avg 43.03% → DAPO-32B avg 61.29% (Table 3).
- Answer-level entropy generally decreases under RLVR:
- DeepSeek-1.5B answer entropy avg 1.30 → ProRL-1.5B 0.66 (Table 3).
- Qwen2.5-32B 1.61 → DAPO-32B 0.61 (Table 3).
- Token-level entropy can increase or decrease depending on model:
- ProRL-1.5B token entropy avg 0.52 vs base 0.44 (increase) (Table 3).
- Some RLVR models show large decreases (e.g., AceReason-14B token entropy avg 0.14 vs base 0.33) (Table 3).
- This directly supports the paper’s claim that local uncertainty and global diversity can diverge (Section 4.2; Table 3).
Do the experiments support the claims?
The support-dynamics metrics and counts strongly support the specific claim: “high retention, low discovery, slight net shrinkage” across many model families and domains (Table 1; Tables 4–10).
The “high-k paradox” is concretely demonstrated on specific benchmarks (e.g., AIME2024, Minerva for ProRL-1.5B-v1 in Table 4) and visually supported by shrinkage-style pass@k curves (Figure 4).
The entropy claim is supported by paired measurements of token vs answer entropy across multiple models (Section 4; Table 3).
One area that remains somewhat indirect is equating “external reference perplexity” with “outside-support reasoning modes,” since the paper itself notes confounds from style/format differences (Section 3.2; Table 2).
Ablations / robustness checks / failure cases (as present in provided content).
The paper includes:
- multiple model families and scales (1.5B–14B, plus a 32B pair in entropy analysis) (Table 1; Table 3),
- multiple domains and large-k budgets (Section 3.1),
- explicit examples of expansion vs shrinkage tasks (Figures 3–4),
- and explicit evaluation-protocol adjustments for Reasoning Gym to reduce formatting artifacts (Appendix B.3).
The provided content does not include classic ablation studies on RLVR hyperparameters (e.g., varying β, sampling temperature during training, etc.); it instead triangulates using diverse model families and the theoretical appendices.

6. Limitations and Trade-offs¶

Dependence on pass@k and finite sampling as a proxy for “reasoning boundary.”
The paper itself notes pass@k is not a complete measure of reasoning capacity; it mainly captures retrieval/accessibility under sampling (Introduction). The entire “empirical support” framing inherits this limitation.
Choice and operationalization of ϵ.
The framework relies on a cutoff ϵ to define empirical reachability (Section 2.1). Although Appendix C.4 provides a confidence-bound-based estimate (e.g., for k=8192 and 95% confidence, an upper bound magnitude ≈3.66×10^-4), mapping that into a single global ϵ across tasks/models can still be delicate.
Evaluation pipeline sensitivity, especially for Reasoning Gym.
The paper modifies answer extraction and normalization to address formatting mismatch between base and ProRL (Appendix B.3). While this is motivated as fairness, it introduces additional processing that could affect correctness determination and thus P/E/S/O.
Training details are not fully specified in the provided content.
For a reader trying to reproduce or attribute causal mechanisms, key training configurations are missing here: optimizer, learning rate schedule, batch sizes, training steps/tokens, and compute/hardware budgets (Section 3.1 provides algorithmic components but not these numbers).
Perplexity-as-support evidence is suggestive but confounded.
The perplexity analysis interprets higher perplexity on external traces as reduced compatibility with broader solution modes, but the paper also acknowledges style/format differences across references (Section 3.2; Table 2). This weakens a strict “perplexity ⇒ support” conclusion.
Theory assumptions vs softmax realities.
The strict support theorem (Theorem C.1) depends on exact zero probabilities, which are uncommon in typical softmax LMs; the paper addresses this via the empirical support relaxation and finite-update bounds (Section 2.1; Theorem C.3 in Appendix C.1), but these are still abstractions over complex training dynamics.

7. Implications and Future Directions¶

How this work changes the landscape (within the scope of the paper).
It reframes evaluation of RLVR from “does it improve accuracy?” to “does it expand the reachable set of correct solutions under realistic sampling budgets?”, providing concrete metrics (SRR/NDR/SDS/NSCR) and showing that, for today’s common RLVR recipes across tested models, shrinkage typically dominates expansion (Section 2.2; Table 1; Section 3.2).
It cautions against interpreting token-level stochasticity as evidence of global exploration, since answer-level diversity can still collapse (Section 4.2; Table 3).
Follow-up research directions suggested by the results.
The paper argues that “breaking the invisible leash” likely requires explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented correct-solution regions (Abstract; Introduction; Conclusion).
The theoretical view of RLVR as an exponential tilt / KL projection (Appendix C.2) suggests a concrete reason: if updates largely reweight what already has mass under q, then expanding coverage needs some mechanism that changes the reachable region under finite sampling (Appendix C.1–C.2).
Practical applications / downstream use cases (practitioner takeaway).
If your deployment uses single-shot generation (or very small k), RLVR-style post-training can be beneficial because it concentrates probability on high-reward outputs, improving precision (Section 4.2; Table 3 shows consistent avg@32 gains, aligning with the general “precision enhancement” story).
If your deployment can afford large-k sampling (many attempts) and values coverage of diverse correct solutions, the paper’s evidence suggests standard RLVR may reduce that coverage, so a base model (or a less-sharpened policy) may perform better at high k on some tasks (Section 3.2; Table 4 AIME2024/Minerva examples; Figure 4).
Repro/Integration Guidance (when to prefer what, based on this paper’s evidence).
Prefer a standard RLVR model when:
- you need higher precision under constrained sampling budgets (small k), and
- tasks have narrow correct-answer sets where mode concentration is acceptable (Section 4.2; Table 3’s answer-entropy reductions).
Be cautious with standard RLVR when:
- you rely on multiple attempts / reranking / search (k large) to recover long-tail correct solutions, because net support change can be negative (Table 1 NSCR<0 in most aggregates; Section 3.2; Figure 4).
If you need both precision and broad coverage, the paper motivates exploring algorithmic additions that explicitly promote exploration or seed mass into underrepresented correct modes (Abstract; Conclusion), although it does not provide an implemented method in the provided content.