THE INVISIBLE LEASH? WHY RLVR MAY OR MAY NOT ESCAPE ITS ORIGIN¶
ArXiv: 2507.14843
🎯 Pitch¶
This paper introduces an "empirical support" framework and large-scale evaluations to show that common RLVR training largely sharpens probability mass onto solutions already present in the base model rather than discovering genuinely new reasoning modes. That matters because RLVR’s apparent gains in single-shot accuracy can mask a narrowing of reachable correct solutions—undermining long-run exploration, multi-sample robustness, and the claim that RLVR expands a model’s true reasoning boundary.
1. Executive Summary (2-3 sentences)¶
This paper investigates whether today’s common recipe for RLVR (Reinforcement Learning with Verifiable Rewards) actually expands a base language model’s reasoning coverage, or mostly sharpens the probability mass onto solutions the base model could already produce. Using a new “empirical support” lens and large sampling budgets across many reasoning benchmarks, it finds that RLVR reliably improves single-try accuracy (e.g., pass@1 / avg@32) but more often shrinks the set of correct solutions that remain reachable under finite sampling than it expands it (Table 1; Figures 2–4). It also shows an entropy pattern where token-by-token uncertainty can increase while the distribution over final answers becomes less diverse, suggesting “local stochasticity without global exploration” (Section 4; Table 3).
2. Context and Motivation¶
- Problem / gap.
- RLVR has become a prominent post-training method for “reasoning models,” using automatically checkable rewards (e.g., correct/incorrect) to optimize a model’s outputs (Introduction; Section 2.1).
-
A key open question is whether standard RLVR:
- expands the model’s reasoning boundary (discovering qualitatively new correct solution modes), or
- mainly improves precision by amplifying already-available high-reward outputs while potentially suppressing alternative correct solutions (Introduction).
-
Why it matters.
-
If RLVR mostly redistributes probability mass within what the base model already “effectively” knows, then RLVR improvements at
pass@1may not translate into broader capability gains—especially when users can sample multiple attempts or when tasks have multiple correct solution routes (Introduction; Section 3.2; Section 4.2). -
Prior approaches and where they fall short (as framed here).
- The paper highlights an empirical pattern seen in prior observations: RLVR-tuned models often win at low sampling budgets (
pass@1) but can lose at high budgets (pass@kfor largek), suggesting reduced coverage (Introduction; Section 3.2 gives concrete examples like AIME2024). -
pass@kis acknowledged as an imperfect proxy for “reasoning boundary,” but it is used as a practical lens consistent with prior evaluation practice (Introduction). -
How this paper positions itself.
- It proposes measurement tools—
empirical supportcategories and derived metrics—to separate:- “precision sharpening” (better top-1) from
- “coverage expansion” (new correct solutions becoming reachable) (Section 2.2; Figure 1).
- It then tests those tools across multiple RLVR model families and domains (Section 3.1; Table 1).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is an evaluation framework that compares a
basemodel distributionq(y|x)to anRLVR-trainedmodel distributionπθ(y|x)by examining which correct completions each model can actually reach under finite sampling. - It solves the problem of distinguishing “RLVR improves accuracy by sharpening” from “RLVR expands what solutions are discoverable,” using large-
ksampling, a probability cutoffϵ, and support-dynamics metrics (Sections 2–3; Figure 1; Table 1).
3.2 Big-picture architecture (diagram in words)¶
- Model pairs. Choose a base model
qand its RLVR counterpartπθ(Section 3.1; Table 1 lists multiple families/scales). - Benchmarks + verifiers/judges. Run on math and non-math reasoning datasets; correctness is verifiable (sometimes via a judge for SimpleQA) (Section 3.1).
- Sampling engine. For each prompt
x, samplekcompletionsyfrom each model with controlled decoding parameters (Appendix B.1). - Correctness filtering. Identify which sampled completions are correct (
R(x,y)=1) (Section 2.1). - Empirical-support analysis. Using a detectability cutoff
ϵ, categorize correct solutions into preservation / shrinkage / expansion / out-of-support (Section 2.2; Figure 1). - Aggregate metrics + diagnostics. Compute SRR/NDR/SDS/NSCR (Section 2.2; Table 1), analyze
pass@kcurves (Figures 2–4), perplexity vs reference traces (Table 2), and entropy at token vs answer level (Section 4; Table 3).
3.3 Roadmap for the deep dive¶
- First, define the objects being measured: distributions, rewards, and “empirical support” vs theoretical support (Section 2.1).
- Next, explain the support dynamics categories and the derived metrics used to summarize them (Section 2.2).
- Then, walk through the experimental pipeline: models, datasets, sampling budgets, and decoding/evaluation details (Section 3.1; Appendix B).
- After that, summarize the main empirical findings on preservation/expansion/shrinkage, including high-
kparadoxes (Section 3.2; Figures 2–4; Tables 1, 4–10). - Finally, explain the distributional diagnostics (entropy/perplexity) and the theoretical support-bounded arguments (Section 4; Table 2–3; Appendix C).
3.4 Detailed, sentence-based technical breakdown¶
- Framing (what kind of paper + core idea).
-
This is primarily an empirical measurement paper with supporting theoretical analysis that treats common RLVR as a support-constrained optimizer: under on-policy sampling, RLVR tends to reweight probability mass among solutions the base model already samples, rather than creating robust access to truly novel solution modes (Section 3.2; Appendix C.1–C.2).
-
Key formal setup: distributions, rewards, and RLVR objective.
- For a prompt
x ∈ Xand completiony ∈ Y, the base model defines a distributionq(y|x)and the RLVR-trained model definesπθ(y|x)(Section 2.1). - A verifiable reward function
R(x,y) ∈ {0,1}marks correctness, and the set of correct completions isC = {y : R(x,y)=1}(Section 2.1). -
The paper writes a generic KL-regularized RLVR objective (used to describe PPO/GRPO-like variants) as: > maximize over
θ: \n>E_{x∼D, y∼πθ(·|x)} [ R(x,y) − β^{-1} log ( πθ(y|x) / q(y|x) ) ](Section 2.1)
whereβ>0controls the strength of the penalty for moving away from the base model. -
Why “support” needs an empirical relaxation.
- In theory, the support of correct completions under distribution
pissupp(p) = { y ∈ C : p(y|x) > 0 }(Definition 2.1). - In practice, softmax models assign nonzero probability nearly everywhere, so “support” becomes trivial; what matters is whether a correct completion has enough probability to be observed under finite rollouts (“empirical support diffusion” discussion, Section 2.1).
- The paper defines an
ϵ-thresholded empirical support: >supp_ϵ(q) = { y ∈ C : q(y|x) > ϵ }(Section 2.1) -
Intuitively,
ϵacts as a detectability cutoff: ifq(y|x)is belowϵ, you likely will not see that completion even with large but finite sampling (Section 2.1; Appendix C.4 derives a bound). -
System/data pipeline diagram in words (explicit step-by-step).
- Select a benchmark prompt
xfrom a dataset distributionD(math or non-math tasks in Section 3.1; Appendix B.2). - Sample
kcompletions from the base model by drawingy_1,…,y_k ~ q(·|x)using fixed decoding settings (temperature0.6,top_p=0.95, max length32768) withvLLMinference (Appendix B.1). - Sample
kcompletions from the RLVR model similarly:y'_1,…,y'_k ~ πθ(·|x)with the same decoding settings (Appendix B.1). - Determine correctness for each sampled completion using a verifiable procedure; for SimpleQA, the evaluation uses
GPT-4.1as judge (Section 3.1). - Collect the set of correct completions observed for each model at that prompt, and map them to empirical support membership based on whether their probability mass is above
ϵ(conceptual definition in Section 2.1; categorization in Figure 1 / Section 2.2). - Categorize correct completions into four regions (Figure 1; Definitions 2.2–2.3):
Preservation (P): correct solutions that are empirically supported by both base and RLVR models.Shrinkage (S): correct solutions empirically supported by the base but not by RLVR.Expansion (E): correct solutions not empirically supported by the base but supported by RLVR.Out of support (O): correct solutions supported by neither under the sampling budget / threshold.
- Aggregate counts across prompts to compute summary metrics:
SRR = P/(P+S)(Support Retention Rate),NDR = E/(P+E)(Net Discovery Rate),SDSas a harmonic-mean style balance between retention and discovery,NSCR = (E−S)/(P+E+S)(net change in empirical support) (Definition 2.3).
-
Run complementary diagnostics:
pass@kas a function ofkon selected tasks (Figures 2–4),- perplexity on reference reasoning traces to probe compatibility with “external” solution styles/modes (Table 2; Section 3.2),
- token-level vs answer-level entropy at
k=32to separate local uncertainty from global diversity (Section 4.1–4.2; Table 3; Appendix B.4).
-
A concrete micro-example of the support categories (illustrative, consistent with the definitions).
- Suppose for a prompt
xthere are two distinct correct completionsy_Aandy_B. - If the base model has
q(y_A|x)=5e-4andq(y_B|x)=1e-5, and the detectability cutoff isϵ=3e-4(Appendix C.4 gives a method to estimate magnitude fromkand confidence), then onlyy_Ais insupp_ϵ(q). -
If RLVR shifts to
πθ(y_A|x)=1e-4andπθ(y_B|x)=4e-4, then:y_Abecomes shrinkage (base had it aboveϵ, RLVR pushed it below),y_Bbecomes expansion (base had it belowϵ, RLVR pushed it above),- and the net effect depends on whether such expansions are more or less frequent than shrinkages across tasks (captured by
NSCRand reflected empirically in Table 1).
-
Experimental setup details the paper specifies (and what it does not).
- Models compared. The main RLVR method analyzed is
ProRL-1.5Bbuilt fromDeepSeek-R1-Distill-Qwen-1.5B, and additional RLVR families includeNemotron(7B/14B),Skywork-OR1-7B,Phi4-Reason-Plus-14B, and a vision-language modelKangheng-OVR-7B (VLM)(Section 3.1; Table 1; Tables 4–10). - Datasets.
- Math:
MATH500,Minerva,OlympiadBench,AIME 2024,AIME 2025,AMC 2023(Section 3.1; Appendix B.2). - Non-math:
SimpleQA,LiveBenchsubsets (R/C/L),SciBench, andReasoning Gym(Section 3.1; Appendix B.2). - VLM math:
MathVistaandMathVisiontestmini sets (Appendix B.2; Table 10).
- Math:
- Sampling budgets (critical to the “empirical support” concept).
- Math tasks:
k ∈ {4096, 8192}(Section 3.1). - Reasoning Gym:
k ∈ {1024, 2048, 4096, 8192, 16384}(Section 3.1). - Other non-math: typically
k ∈ {1024, 2048}(Section 3.1).
- Math tasks:
- Decoding/inference configuration.
temperature=0.6,top_p=0.95,max_response_length=32768, inference backendvLLM(Appendix B.1).
-
Missing training hyperparameters.
- The paper does not provide optimizer type/settings, learning-rate schedule, batch size, number of training steps, training tokens, or compute budget for the RLVR runs in the provided content. It describes algorithmic ingredients at a high level (e.g., GRPO with decoupled clipping, dynamic sampling, KL regularization, periodic reference resets) for ProRL’s framework (Section 3.1), but not the concrete hyperparameter values.
-
Reasoning Gym evaluation corrections (important for interpretation).
- The paper notes that the base model can be unfairly penalized by formatting mismatches on Reasoning Gym because ProRL was trained on that format, so it implements enhanced answer extraction and some task-specific normalizations to reduce format-driven evaluation artifacts (Appendix B.3).
-
This matters because “support shrinkage/expansion” depends on what is counted as a correct completion, so extraction/normalization can change
P/E/S/Ocounts (Appendix B.3). -
Entropy measurement methodology (how they compute it).
Token-level entropyis computed by teacher-forcing: after generating 32 samples, the sequence is fed back in one forward pass to read off the token distribution at each step and average entropies across steps and samples (Section 4.1; Appendix B.4).-
Answer-level entropyis computed by extracting final answers from the 32 completions, forming the empirical distribution over unique answers, and computing Shannon entropy of that discrete distribution (Section 4.1; Appendix B.4). -
Theoretical support-bounded arguments (what mechanism is claimed).
- The paper formalizes a support-preservation claim for on-policy policy-gradient RLVR:
>
supp(πθ(·|x)) ⊆ supp(q(·|x))(Theorem C.1, Appendix C.1)
meaning if the base model assigns exactly zero probability to a correct completion, RLVR cannot discover it because it never gets sampled and thus never contributes gradient. - Because exact zeros are rare with softmax models, it extends the argument to
ϵ-empirical support under finite sampling, arguing that with finite steps and budgets, RLVR remains bounded by what is detectable underq(Theorem C.3, Appendix C.1). - It also gives a variational view: RLVR behaves like an exponential “tilt” of the base distribution by reward:
>
π*(y|x) ∝ q(y|x) · exp(β R(x,y))(Proposition C.4, Appendix C.2)
and in a KL-free limit, it approachesqrenormalized over correct completions (Corollary C.5, Appendix C.2).
This directly suggests “distribution sharpening” rather than mode creation.
4. Key Insights and Innovations¶
- (1) “Empirical support” as a measurable proxy for reachable reasoning solutions.
- Novelty: Instead of treating model support as a theoretical set (trivial under softmax), it introduces
supp_ϵto represent solutions that are practically reachable under finite sampling (Section 2.1; Figure 1). -
Significance: This reframes RLVR evaluation from “does accuracy improve?” to “does the set of reachable correct solutions expand or shrink?” (Sections 2.2–3.2).
-
(2) A four-way taxonomy + metrics for support dynamics.
- Novelty: The paper defines
Preservation,Shrinkage,Expansion, andOut of Supportfor correct completions, then summarizes behavior viaSRR,NDR,SDS, andNSCR(Definitions 2.2–2.3). -
Significance: These metrics separate “RLVR keeps what it had” (high
SRR) from “RLVR finds genuinely new reachable solutions” (highNDR), exposing cases where RLVR improvespass@1while still having net negative support change (Table 1). -
(3) Cross-domain evidence that shrinkage typically outweighs expansion under current recipes.
- Novelty: The central claim is not just observed in one domain; it is evaluated across math, logic, factual QA, coding, and even VLM math benchmarks (Section 3.1; Table 1; Tables 4–10).
-
Significance: This supports the paper’s “invisible leash” framing: RLVR is strongly tied to the base model’s effective distribution under these training/evaluation conditions (Section 3.2; Conclusion).
-
(4) Decoupling between token-level entropy and answer-level entropy.
- Novelty: It explicitly measures both and shows they can move in opposite directions—token-level entropy can increase while answer-level entropy decreases (Section 4.2; Table 3).
- Significance: This undermines a common intuition that “more stochastic decoding paths” necessarily imply more exploration of distinct solutions, introducing the phrase/idea “local stochasticity without global exploration” (Section 4.2).
5. Experimental Analysis¶
- Evaluation methodology (datasets, metrics, setup).
- Primary metrics.
pass@kcurves and high-kcomparisons are used as a proxy for solution accessibility (Introduction; Section 3.2; Figures 2–4).- Support dynamics metrics
SRR/NDR/SDS/NSCRsummarize how reachable correct completions change (Section 2.2; Table 1; Tables 4–10). avg@32is used for medium-budget accuracy comparisons in entropy analysis (Section 4.1–4.2; Table 3).- Perplexity is used as an auxiliary diagnostic against base traces, RLVR traces, and external references (Section 3.2; Table 2).
- Entropy metrics:
TokenEntropyandAnswerEntropy(Section 4.1; Table 3).
- Decoding controls. All inference uses
temperature=0.6,top_p=0.95, andmax_response_length=32768withvLLM(Appendix B.1). -
Sampling budgets. Up to
k=8192for math andk=16384for Reasoning Gym (Section 3.1), which is crucial because the paper’s thesis is about large sampling regimes revealing hidden coverage differences. -
Main quantitative results (with specific numbers).
- Aggregate support dynamics show high retention but near-zero discovery.
- Across models/domains in Table 1:
SRRis very high (roughly0.93–0.99overall across listed models).NDRis very low (typically≤ 0.04, often around0.00–0.02).NSCRis slightly negative in most cases, indicating net shrinkage (Table 1).- Example (ProRL-1.5B-v2 overall):
SRR=0.93,NDR=0.02,SDS=0.04,NSCR=-0.05, with countsP=2388, E=48, S=175, O=793(Table 1).
- Concrete “high-k paradox” example (base beats RLVR at large k).
- For
AIME2024with ProRL-1.5B-v1 atpass@8192, base is93.3%while RLVR is83.3%(Table 4; also discussed in Section 3.2). - For ProRL-1.5B-v1 on
Minerva(pass@8192), base is71.7%while RLVR is63.6%(Table 4).
- For
- Pass@k curve shapes illustrate preservation vs expansion vs shrinkage.
- Preservation-typical tasks (Reasoning Gym examples) show RLVR accelerating convergence toward high
pass@kbut not necessarily changing what is ultimately reachable (Figure 2). - Some tasks show expansion behavior (e.g.,
boxnet,dice,arc_1dexamples) where RLVR improves across k in a way interpreted as bringing new modes above detectability (Figure 3). - Shrinkage examples show base eventually overtaking RLVR at larger k (Figure 4).
- Preservation-typical tasks (Reasoning Gym examples) show RLVR accelerating convergence toward high
- Perplexity against external reasoning traces suggests limited compatibility with “outside-support” styles/modes.
- On unsolved-by-both problems, perplexity vs an external reference labeled “Claude Sonnet 4” increases substantially from base to ProRL, e.g.:
AIME2024: base8.76→ ProRL14.91(Table 2).AIME2025: base6.05→ ProRL9.76(Table 2).- The paper cautions that style/format differences can also contribute to perplexity gaps, not only “support” (Section 3.2).
-
Entropy + accuracy trade-off at k=32 on math benchmarks.
- Accuracy (
avg@32) increases under RLVR variants: DeepSeek-1.5Bavg54.54%→ProRL-1.5Bavg65.43%(Table 3).Qwen2.5-32Bavg43.03%→DAPO-32Bavg61.29%(Table 3).- Answer-level entropy generally decreases under RLVR:
DeepSeek-1.5Banswer entropy avg1.30→ProRL-1.5B0.66(Table 3).Qwen2.5-32B1.61→DAPO-32B0.61(Table 3).- Token-level entropy can increase or decrease depending on model:
ProRL-1.5Btoken entropy avg0.52vs base0.44(increase) (Table 3).- Some RLVR models show large decreases (e.g.,
AceReason-14Btoken entropy avg0.14vs base0.33) (Table 3). - This directly supports the paper’s claim that local uncertainty and global diversity can diverge (Section 4.2; Table 3).
- Accuracy (
-
Do the experiments support the claims?
- The support-dynamics metrics and counts strongly support the specific claim: “high retention, low discovery, slight net shrinkage” across many model families and domains (Table 1; Tables 4–10).
- The “high-k paradox” is concretely demonstrated on specific benchmarks (e.g., AIME2024, Minerva for ProRL-1.5B-v1 in Table 4) and visually supported by shrinkage-style
pass@kcurves (Figure 4). - The entropy claim is supported by paired measurements of token vs answer entropy across multiple models (Section 4; Table 3).
-
One area that remains somewhat indirect is equating “external reference perplexity” with “outside-support reasoning modes,” since the paper itself notes confounds from style/format differences (Section 3.2; Table 2).
-
Ablations / robustness checks / failure cases (as present in provided content).
- The paper includes:
- multiple model families and scales (1.5B–14B, plus a 32B pair in entropy analysis) (Table 1; Table 3),
- multiple domains and large-k budgets (Section 3.1),
- explicit examples of expansion vs shrinkage tasks (Figures 3–4),
- and explicit evaluation-protocol adjustments for Reasoning Gym to reduce formatting artifacts (Appendix B.3).
- The provided content does not include classic ablation studies on RLVR hyperparameters (e.g., varying
β, sampling temperature during training, etc.); it instead triangulates using diverse model families and the theoretical appendices.
6. Limitations and Trade-offs¶
- Dependence on
pass@kand finite sampling as a proxy for “reasoning boundary.” -
The paper itself notes
pass@kis not a complete measure of reasoning capacity; it mainly captures retrieval/accessibility under sampling (Introduction). The entire “empirical support” framing inherits this limitation. -
Choice and operationalization of
ϵ. -
The framework relies on a cutoff
ϵto define empirical reachability (Section 2.1). Although Appendix C.4 provides a confidence-bound-based estimate (e.g., fork=8192and 95% confidence, an upper bound magnitude≈3.66×10^-4), mapping that into a single globalϵacross tasks/models can still be delicate. -
Evaluation pipeline sensitivity, especially for Reasoning Gym.
-
The paper modifies answer extraction and normalization to address formatting mismatch between base and ProRL (Appendix B.3). While this is motivated as fairness, it introduces additional processing that could affect correctness determination and thus
P/E/S/O. -
Training details are not fully specified in the provided content.
-
For a reader trying to reproduce or attribute causal mechanisms, key training configurations are missing here: optimizer, learning rate schedule, batch sizes, training steps/tokens, and compute/hardware budgets (Section 3.1 provides algorithmic components but not these numbers).
-
Perplexity-as-support evidence is suggestive but confounded.
-
The perplexity analysis interprets higher perplexity on external traces as reduced compatibility with broader solution modes, but the paper also acknowledges style/format differences across references (Section 3.2; Table 2). This weakens a strict “perplexity ⇒ support” conclusion.
-
Theory assumptions vs softmax realities.
- The strict support theorem (Theorem C.1) depends on exact zero probabilities, which are uncommon in typical softmax LMs; the paper addresses this via the empirical support relaxation and finite-update bounds (Section 2.1; Theorem C.3 in Appendix C.1), but these are still abstractions over complex training dynamics.
7. Implications and Future Directions¶
- How this work changes the landscape (within the scope of the paper).
- It reframes evaluation of RLVR from “does it improve accuracy?” to “does it expand the reachable set of correct solutions under realistic sampling budgets?”, providing concrete metrics (
SRR/NDR/SDS/NSCR) and showing that, for today’s common RLVR recipes across tested models, shrinkage typically dominates expansion (Section 2.2; Table 1; Section 3.2). -
It cautions against interpreting token-level stochasticity as evidence of global exploration, since answer-level diversity can still collapse (Section 4.2; Table 3).
-
Follow-up research directions suggested by the results.
- The paper argues that “breaking the invisible leash” likely requires explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented correct-solution regions (Abstract; Introduction; Conclusion).
-
The theoretical view of RLVR as an exponential tilt / KL projection (Appendix C.2) suggests a concrete reason: if updates largely reweight what already has mass under
q, then expanding coverage needs some mechanism that changes the reachable region under finite sampling (Appendix C.1–C.2). -
Practical applications / downstream use cases (practitioner takeaway).
- If your deployment uses single-shot generation (or very small
k), RLVR-style post-training can be beneficial because it concentrates probability on high-reward outputs, improving precision (Section 4.2; Table 3 shows consistentavg@32gains, aligning with the general “precision enhancement” story). -
If your deployment can afford large-k sampling (many attempts) and values coverage of diverse correct solutions, the paper’s evidence suggests standard RLVR may reduce that coverage, so a base model (or a less-sharpened policy) may perform better at high
kon some tasks (Section 3.2; Table 4 AIME2024/Minerva examples; Figure 4). -
Repro/Integration Guidance (when to prefer what, based on this paper’s evidence).
- Prefer a standard RLVR model when:
- you need higher precision under constrained sampling budgets (small
k), and - tasks have narrow correct-answer sets where mode concentration is acceptable (Section 4.2; Table 3’s answer-entropy reductions).
- you need higher precision under constrained sampling budgets (small
- Be cautious with standard RLVR when:
- you rely on multiple attempts / reranking / search (
klarge) to recover long-tail correct solutions, because net support change can be negative (Table 1NSCR<0in most aggregates; Section 3.2; Figure 4).
- you rely on multiple attempts / reranking / search (
- If you need both precision and broad coverage, the paper motivates exploring algorithmic additions that explicitly promote exploration or seed mass into underrepresented correct modes (Abstract; Conclusion), although it does not provide an implemented method in the provided content.