Evaluating Parameter Efficient Methods for RLVR¶
ArXiv: 2512.23165
🎯 Pitch¶
This paper presents the first large-scale benchmark of 12+ parameter-efficient fine-tuning (PEFT) methods under Reinforcement Learning with Verifiable Rewards (RLVR), showing that structural LoRA variants (notably DoRA, AdaLoRA, MiSS) consistently outperform standard LoRA while SVD-based initializations can catastrophically fail. By identifying an expressivity floor for RLVR and recommending geometry-aware adapters, the work enables more reliable, compute- and memory-efficient RL training for stronger mathematical reasoning in LLMs.
1. Executive Summary (2-3 sentences)¶
This paper benchmarks parameter-efficient fine-tuning (PEFT) methods for RLVR (Reinforcement Learning with Verifiable Rewards) on math reasoning, asking which adapter design works best when training from sparse, verifier-provided rewards rather than dense supervised targets. Across 12+ PEFT methods on DeepSeek-R1-Distill-Qwen 1.5B and 7B, structural LoRA variants (notably DoRA, also AdaLoRA and MiSS) consistently outperform standard LoRA, while SVD-initialized methods (PiSSA, MiLoRA) can collapse badly. The results also indicate an “expressivity floor”: extremely tiny adapters (e.g., vector-only schemes like VeRA, IA3) bottleneck reasoning performance even if they are very parameter-efficient.
2. Context and Motivation¶
- Problem / gap addressed
RLVRis increasingly used to boost LLM reasoning using verifiable (rule-based) feedback, often a binary reward signal (correct/incorrect).-
Despite many
PEFTmethods existing, RLVR practice often defaults to standard LoRA, and it is unclear whether LoRA is actually the best PEFT architecture under RL optimization dynamics. -
Why it matters
- RL training is described as resource-intensive and unstable, and RLVR supervision is sparse (binary), which can imply parameter redundancy if one updates the whole model.
-
If a better PEFT design can match or exceed full fine-tuning under RLVR, it could reduce memory/compute needs while improving reasoning.
-
Prior approaches and shortcomings (as positioned here)
- Standard LoRA is commonly used for RLVR and is known to be competitive with full fine-tuning in some prior RL settings.
-
However, many LoRA variants (e.g.,
DoRA) have shown advantages in fine-tuning scenarios, motivating a systematic RLVR-specific comparison. -
How the paper positions itself
- It presents the first large-scale, multi-method PEFT evaluation specifically under RLVR, aiming to produce actionable guidance on what adapter types to prefer or avoid in RLVR (Figure 2 overview; Table 1 method definitions).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is an RL training pipeline for math reasoning LLMs where only a small set of parameters (adapters) is trained while the base model is mostly frozen.
- It solves efficient RLVR fine-tuning by comparing multiple PEFT adapter designs under the same RLVR setup (mainly
DAPO-styleGRPO), measuring downstream math accuracy.
3.2 Big-picture architecture (diagram in words)¶
- Base model (DeepSeek-R1-Distill-Qwen
1.5Bor7B)
→ Attach a PEFT method (e.g., LoRA / DoRA / AdaLoRA / MiSS / …; Table 1)
→ RLVR rollout generation (sample multiple completions per prompt usingvLLM)
→ Verifier-based reward (parse\boxed{}answer; check equivalence withlatex2sympy+math verify)
→ Compute GRPO/DAPO advantages from group rewards (Eq. (1))
→ Update only adapter/trainable PEFT parameters using the RL objective
→ Evaluate on math benchmarks with multi-sample metrics (Avg@k,Pass@k; Table 2).
3.3 Roadmap for the deep dive¶
- Explain
RLVRand the specific RL objective used (GRPO, plusDAPO/Dr. GRPO) because it determines the gradient signal PEFT must learn from. - Define the PEFT method families and what changes between them (structure vs initialization vs extreme compression) because that is the main experimental variable.
- Describe the training pipeline end-to-end (data → rollouts → reward → update) to clarify what is held constant across methods.
- Detail experimental controls (hyperparameters, targets, evaluation metrics) to interpret comparisons fairly.
- Summarize the key mechanistic analysis used to explain failures (the “spectral collapse” analysis around Figure 3).
3.4 Detailed, sentence-based technical breakdown¶
-
Framing: This is an empirical benchmarking + mechanistic analysis paper whose core idea is to hold an RLVR training recipe fixed and systematically swap in different PEFT adapter designs to see which ones best support RL optimization from sparse rewards.
-
RLVR objective and what it means
RLVRtrains the model using a verifier that gives a binary rewardR ∈ {0,1}indicating whether the final answer is correct (Section 2.4).- The paper builds on
GRPO(Group Relative Policy Optimization), which avoids training a separate critic by estimating advantages from a group of sampled responses for the same prompt (Section 2.1). - In plain language, Eq. (1) says: for each prompt, sample
Gcandidate responses, compare their likelihood under the new policy vs an old snapshot, and adjust the policy to increase the probability of responses that got higher-than-average reward within that group, while using PPO-style clipping for stability. - Equation (1), paraphrased before notation: The training objective increases the probability of tokens in better-than-average responses, but caps the update size so learning does not become unstable.
- Symbols (Eq. (1))
qis a prompt sampled from datasetD.{o_i}are theGsampled responses under the old policyπ_{θ_old}.o_{i,t}is tokentin responsei, ando_{i,<t}is its prefix.- The ratio
π_θ / π_{θ_old}is the PPO importance-sampling ratio. Â_iis a standardized advantage computed from the group rewards:
Â_i = (R_i - mean({R_j})) / std({R_j}).εis the PPO-style clip threshold; underDAPOit is modified (see below).
DAPOmodifies GRPO to improve stability in long chain-of-thought settings by:- Using “clip-higher” with a larger upper clipping bound (the paper cites an example
ε_high = 0.28) to allow uplifting low-probability exploration tokens (Section 2.1). - Dynamically filtering prompts where all sampled outputs get identical rewards (all
0or all1) to preserve gradient signal (Section 2.1).
- Using “clip-higher” with a larger upper clipping bound (the paper cites an example
-
Dr. GRPOchanges GRPO by removing (i) the per-response length normalization term1/|o_i|and (ii) the group-level standard deviation in the advantage, to address biases the paper describes (Section 2.1). -
PEFT methods compared (what changes between methods)
- The core baseline is standard LoRA (Section 2.2): freeze the pretrained weight
W0and learn a low-rank updateΔW = (α/r) B A.- Forward:
h = W0 x + (α/r) B A x(Eq. (2)). - Initialization:
A ~ N(0, σ^2),B ~ 0, soΔW = 0at step 0.
- Forward:
-
The paper groups PEFT methods into five categories (Figure 2; Table 1):
- Baselines: full-parameter fine-tuning vs standard LoRA.
- Structural variants: change the adapter structure, not just its init.
DoRA: decouples magnitude and direction via a normalization-like decomposition (Table 1).AdaLoRA: uses an SVD-like factorizationP Λ Qwith adaptive rank/budget allocation (Table 1).MiSS: uses a shard-sharing / sub-network style structure (Table 1 showsexpand(D)withDinitialized to 0).- Initialization strategies: keep
W0 + B A, but change initialization/optimization. PiSSA: initializes adapters using top singular components ofW0(Table 1).MiLoRA: initializes using minor singular components ofW0(Table 1).LoRA+: changes learning dynamics by using different learning rates forAvsB(Table 1 indicatesη_B = λ η_A).rsLoRA: uses a rank-stabilization scaling factor (Table 1).- Efficiency-oriented variants: reduce memory/training cost further.
LoRA-FA: freezesAand trains onlyB(Table 1).VeRA: freezes random low-rank matrices and trains only scaling vectors (Table 1).- Other PEFT mechanisms:
IA3: trains multiplicative scaling vectors applied to activations (Table 1).LayerNorm tuning: trains only LayerNorm gain/bias parameters (Table 1).
-
What happens first, second, third… (pipeline narrative)
- Choose a base model: DeepSeek-R1-Distill-Qwen-
1.5Bor7B(Section 2.3). - Attach one PEFT method and decide which parameters are trainable (Table 1), while freezing the base weights (except full fine-tuning).
- Sample training prompts from
open-r1/DAPO-Math-17k-Processed(~17.4kmath queries; Section 2.3). - Generate rollouts: for each prompt, sample
G = 8completions (Section 2.4) usingvLLMin co-location mode for throughput (Section 2.4). - Enforce an output format: reasoning in
<think>...</think>, final answer in\boxed{}(Section 2.3), so the verifier can reliably extract the answer. - Compute a binary reward: parse the boxed answer, compare to ground truth with
latex2sympy+math verify; reward isR=1if equivalent elseR=0(Section 2.4). - Compute group-relative advantages from the
Grewards and optimize the RL objective (Eq. (1)), usingDAPOas the default RLVR algorithm (Section 2.1). - Update trainable parameters only (the adapters / scaling vectors / LayerNorm params, depending on the method), using the fixed training hyperparameters described below.
-
Evaluate the resulting model on held-out math benchmarks with multi-sample decoding and
Avg@k/Pass@kmetrics (Section 2.5; Table 2). -
Core configurations / hyperparameters (as specified)
- Adapter targeting: apply PEFT to all linear modules
{q, k, v, o, gate, up, down} proj(Section 2.2). - PEFT config (shared): rank
r = 32, dropout0.05, alpha64for all PEFT methods (Section 2.2). - RLVR rollout count:
G = 8rollouts per prompt (Section 2.4). - Learning rate: constant
1e-5, no warmup (Section 2.4). (Additional LR ablations use5e-6and1e-6; Figure 2 / Figure 4 / Table 5.) - Sequence lengths (training): max prompt length
512, completion length16384tokens (Section 2.4). - DAPO settings: epsilon
0.28(“clip-higher”), and no KL coefficient (βnot used) (Section 2.4). - Batching / steps:
1.5B: per-device batch size4, global batch size128,1024steps, gradient accumulation8(Section 2.4).7B: per-device batch size1, global batch size32,8192steps, gradient accumulation8(Section 2.4).
- Systems tooling:
AcceleratewithDeepSpeed ZeRO-2(optimizer state offload) to reduce memory, andvLLMfor rollout generation (Section 2.4). - Evaluation decoding: temperature
0.6, top-p0.95, max tokens32768, seed42(Section 2.5). -
Optimizer / weight decay / model architecture details: The excerpt does not specify the optimizer type (e.g., AdamW settings), weight decay, or model architecture hyperparameters (layers, hidden size, heads, tokenizer). I therefore do not infer them.
-
Mechanistic analysis method: spectral “collapse/misalignment”
- To explain why SVD-based initializations fail, the paper inspects how weight updates distribute across singular value components (Figure 3 left) and plots the cumulative energy explained by top components (Figure 3 center), alongside RL training curves (Figure 3 right).
- The paper’s key interpretation is that
PiSSAfails because it forces updates into principal components, which conflicts with the RLVR update tendencies they cite, whileMiLoRAfails because its “minor component” initialization has negligible magnitude, so optimization quickly reorients updates back toward dominant principal directions (Section 3.1; Eq. (3); Figure 3).
4. Key Insights and Innovations¶
- (1) Structural PEFT variants are better aligned with RLVR than standard LoRA
- What is new here: A broad controlled comparison shows that changing adapter structure (not just rank or init) can consistently beat standard LoRA under RLVR.
-
Why it matters: Under the
1.5Bsetting,DoRAachieves the highest reported overall average accuracy and even exceeds full fine-tuning in this benchmark suite (Table 3). -
(2) SVD-informed initialization can catastrophically fail in RLVR (“spectral collapse / misalignment”)
- What is new here: The paper reports severe collapse for
PiSSAand major degradation forMiLoRAunder RLVR (Table 3), then uses spectral plots (Figure 3) to argue the failure is tied to mismatch between SVD-biased update directions and RL optimization dynamics. -
Why it matters: These methods are often motivated as “better initialization” for fine-tuning, but this evaluation suggests they may be actively harmful under verifier-reward RL.
-
(3) An “expressivity floor”: extreme parameter reduction degrades reasoning
- What is new here: The paper identifies a boundary where adapter capacity becomes too small to support RLVR-driven reasoning improvements.
-
Why it matters: Methods that only train tiny vector scalings (
VeRA,IA3, sometimes LayerNorm-only tuning) can be much less effective than modest low-rank adapters, even if RLVR is sparse (Table 3; Table 4). -
(4) Robustness checks across RLVR algorithms, rank, batch size, and LR
- What is new here: Ablations (Figure 4; Table 5) suggest the high-level method ranking is not overly sensitive to swapping among
GRPO,DAPO, andDr. GRPO, and that moderate-to-higher ranks (e.g.,16,32) are better than extreme low rank (1) for RLVR.
5. Experimental Analysis¶
- Evaluation methodology
- Training data:
open-r1/DAPO-Math-17k-Processedwith ~17.4kmath queries (Section 2.3). - Reward: strict binary outcome reward via math equivalence verification (Section 2.4).
- Benchmarks: AIME24 (30), AIME25 (30), AMC (40), HMMT (30), MATH500 (500), Minerva (272) (Table 2).
- Metrics:
Avg@k: average accuracy acrosskgenerations per problem (Section 2.5; Table 2 usesk=32for AIME/AMC/HMMT andk=4for MATH500/Minerva).Pass@k: whether at least one of thekgenerations is correct (Section 2.5; tables report both “Avg.” and “Pass” columns).
-
Controlled conditions: unified adapter targeting and shared LoRA hyperparameters (rank 32, dropout 0.05, alpha 64) across PEFT methods (Section 2.2), plus unified RLVR recipe (Section 2.4).
-
Main quantitative results (DeepSeek-R1-Distill-Qwen-1.5B)
- Overall average accuracy (Table 3):
Base:40.5Fullfine-tuning:44.9LoRA:42.5- Structural:
DoRA:46.6(highest)AdaLoRA:44.2MiSS:43.4- Efficiency:
LoRA-FA:43.0VeRA:40.7- Initialization:
LoRA+:43.9rsLoRA:42.3MiLoRA:18.0PiSSA:0.2- Other:
IA3:22.3LN Tuning:41.8
-
Notable benchmark-specific highlights (Table 3):
DoRAshows strong gains on AIME24 Avg (39.0) vs LoRA (33.2) and Full (34.9).PiSSAis near-zero across benchmarks (e.g., AIME24 Avg0.0, overall0.2), consistent with collapse.MiLoRAis severely degraded (overall18.0), much worse than LoRA.
-
Expressivity vs trainable parameter fraction
- Table 4 links performance to trainable parameter percentage:
Full:100%→44.9LoRA:1.55%→42.5MiSS:0.99%→43.4LoRA rank 1:0.0015%→40.5VeRA:0.0029%→40.7LN Tuning:0.0035%→41.8
-
This supports the paper’s claim that moderate parameter reduction can work, but extreme reduction tends to cap reasoning gains.
-
Training dynamics
- Figure 1 (right) plots “accuracy reward” over training steps and shows different convergence behaviors across methods.
-
Figure 3 (right) specifically shows
MiLoRAandPiSSAcollapsing compared toLoRAandFull. -
Ablation results (LoRA-focused robustness checks)
- Table 5 varies batch size, learning rate, rank, and RLVR algorithm.
- Batch size (Table 5; Figure 5):
Bsz 32: overall43.0vs the baseline LoRA42.5(baseline uses global batch 128).- The paper notes the “small batch heuristic” from SFT is weaker here, and larger batch can help on AIME24.
- Learning rate (Table 5):
1e-5: overall42.35e-6: overall42.3(same overall in the table, with different per-benchmark numbers)
- Rank (Table 5):
r=1: overall40.5(worse)r=8: overall42.3r=16: overall43.9(best among shown ranks)
-
RLVR algorithm (Table 5):
Dr. GRPO: overall42.0GRPO: overall40.5- The paper interprets differences as not statistically significant overall, emphasizing “algorithmic invariance,” though the excerpt does not include standard deviations/confidence intervals (it says “std dev removed for clarity”).
-
Scaling to 7B
- Table 6 (DeepSeek-R1-Distill-Qwen-7B) reports overall average accuracies:
LoRA:54.8DoRA:55.0MiSS:53.4LoRA+:55.5(highest in Table 6)
-
The relative ordering largely persists:
DoRA/LoRA+slightly exceedLoRA, suggesting the “structural/optimization-aware” benefits are not confined to 1.5B. -
Do experiments support the claims?
- The tables clearly support:
- Structural variants (
DoRA,AdaLoRA,MiSS) ≥LoRAon overall average in the 1.5B suite (Table 3). - SVD-based inits (
PiSSA,MiLoRA) are dramatically worse (Table 3) with collapse-like curves (Figure 3 right). - Extreme compression methods (notably
IA3) can be far worse (Table 3), consistent with an expressivity bottleneck framing.
- Structural variants (
- What is less fully supported in the excerpt:
- The mechanistic “off-principal” explanation is plausible within their spectral plots (Figure 3), but depends on broader theoretical claims (they cite Zhu et al., 2025) that are not derived in this excerpt. The paper’s own Figure 3 provides correlational evidence (update spectra + collapse), not a proof.
6. Limitations and Trade-offs¶
- Model and domain scope
- Only two base models are evaluated: DeepSeek-R1-Distill-Qwen
1.5Band7B(Section 2.3). -
Tasks are exclusively mathematical reasoning benchmarks (Table 2), so it is unclear how conclusions transfer to non-math domains (coding, instruction following, dialog, etc.).
-
RLVR recipe specificity
-
Default training uses
DAPOwithε = 0.28and no KL term (βabsent), plus very long completion length (16384training;32768eval). Different RLVR stabilization choices (e.g., KL-regularization) might change adapter behavior, but are not explored beyond algorithm ablations. -
Training horizon differences
-
The 1.5B runs use
1024steps while the 7B runs use8192steps (Section 2.4). This makes “scaling” comparisons informative but not perfectly controlled for step budget. -
Incomplete reporting for full reproducibility (from the provided excerpt)
-
The excerpt does not specify optimizer type, optimizer hyperparameters, or full system hardware/compute budget (GPU type, total FLOPs/PF-days). This constrains how precisely one can replicate or normalize results.
-
Mechanistic claims vs evidence
- The “spectral misalignment” explanation uses spectral plots (Figure 3) and a qualitative gradient-flow argument (Eq. (3)), but the excerpt does not include alternative hypotheses tests (e.g., numerically fixing MiLoRA init magnitude to see if collapse disappears), so causality is suggestive rather than definitive.
7. Implications and Future Directions¶
- How this changes practice
- For RLVR on math reasoning in this setup, standard LoRA is not the best default: structural methods like
DoRA(and alsoAdaLoRA,MiSS) provide a better accuracy–parameter trade-off (Table 3; Figure 1). -
LoRA+appears robust and scales well to 7B (Table 6), suggesting that optimization-dynamics tweaks (learning rate ratios) can matter as much as structural changes. -
What follow-up research it enables
-
The paper’s future work section proposes:
- Migrating to higher-performance RL infrastructure (from TRL to something like “VeRL”) to scale experiments further (Section 4).
- Deeper mechanistic interpretability of adapter dynamics, especially the spectral evolution arguments behind why
DoRAworks and why SVD-based init collapses (Section 4). - Testing broader environments: multimodal, multi-turn, asynchronous RL, and deployment issues like weight merging stability (Section 4).
-
Practical applications / downstream use
-
Efficient RLVR training for reasoning-centric assistants where correctness is verifiable (e.g., math solvers) benefits from picking adapters that preserve enough “plasticity” while staying parameter-efficient.
-
Repro/Integration Guidance (when to prefer what, based on this paper’s evidence)
- If you are doing RLVR with verifiable binary rewards on reasoning tasks similar to these benchmarks:
- Prefer
DoRAas a strong default when you want maximum accuracy among PEFTs; on 1.5B it even exceeds full fine-tuning in overall average (46.6vs44.9, Table 3). - Consider
AdaLoRAorMiSSif you want structural variants that beat LoRA but may have different parameter fractions (MiSSis0.99%trainable in Table 4). - Consider
LoRA+as an optimization-stable variant; it is competitive at 1.5B (43.9, Table 3) and best among the listed methods at 7B (55.5, Table 6). - Avoid SVD-based init methods like
PiSSAandMiLoRAunder this RLVR recipe due to severe collapse (0.2and18.0, Table 3; Figure 3). - Avoid extreme compression (e.g.,
IA3at22.3, Table 3) if your goal is reasoning improvement; the paper’s data suggests you need a minimum adapter capacity (“expressivity floor,” Table 4).
- Prefer
- If you must reduce parameters aggressively for hardware reasons, the results suggest moderate reductions like
LoRA-FAcan remain competitive (43.0, Table 3) compared to methods that only train scaling vectors.