Skip to content

Evaluating Parameter Efficient Methods for RLVR

ArXiv: 2512.23165

🎯 Pitch

This paper presents the first large-scale benchmark of 12+ parameter-efficient fine-tuning (PEFT) methods under Reinforcement Learning with Verifiable Rewards (RLVR), showing that structural LoRA variants (notably DoRA, AdaLoRA, MiSS) consistently outperform standard LoRA while SVD-based initializations can catastrophically fail. By identifying an expressivity floor for RLVR and recommending geometry-aware adapters, the work enables more reliable, compute- and memory-efficient RL training for stronger mathematical reasoning in LLMs.


1. Executive Summary (2-3 sentences)

This paper benchmarks parameter-efficient fine-tuning (PEFT) methods for RLVR (Reinforcement Learning with Verifiable Rewards) on math reasoning, asking which adapter design works best when training from sparse, verifier-provided rewards rather than dense supervised targets. Across 12+ PEFT methods on DeepSeek-R1-Distill-Qwen 1.5B and 7B, structural LoRA variants (notably DoRA, also AdaLoRA and MiSS) consistently outperform standard LoRA, while SVD-initialized methods (PiSSA, MiLoRA) can collapse badly. The results also indicate an “expressivity floor”: extremely tiny adapters (e.g., vector-only schemes like VeRA, IA3) bottleneck reasoning performance even if they are very parameter-efficient.

2. Context and Motivation

  • Problem / gap addressed
  • RLVR is increasingly used to boost LLM reasoning using verifiable (rule-based) feedback, often a binary reward signal (correct/incorrect).
  • Despite many PEFT methods existing, RLVR practice often defaults to standard LoRA, and it is unclear whether LoRA is actually the best PEFT architecture under RL optimization dynamics.

  • Why it matters

  • RL training is described as resource-intensive and unstable, and RLVR supervision is sparse (binary), which can imply parameter redundancy if one updates the whole model.
  • If a better PEFT design can match or exceed full fine-tuning under RLVR, it could reduce memory/compute needs while improving reasoning.

  • Prior approaches and shortcomings (as positioned here)

  • Standard LoRA is commonly used for RLVR and is known to be competitive with full fine-tuning in some prior RL settings.
  • However, many LoRA variants (e.g., DoRA) have shown advantages in fine-tuning scenarios, motivating a systematic RLVR-specific comparison.

  • How the paper positions itself

  • It presents the first large-scale, multi-method PEFT evaluation specifically under RLVR, aiming to produce actionable guidance on what adapter types to prefer or avoid in RLVR (Figure 2 overview; Table 1 method definitions).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is an RL training pipeline for math reasoning LLMs where only a small set of parameters (adapters) is trained while the base model is mostly frozen.
  • It solves efficient RLVR fine-tuning by comparing multiple PEFT adapter designs under the same RLVR setup (mainly DAPO-style GRPO), measuring downstream math accuracy.

3.2 Big-picture architecture (diagram in words)

  • Base model (DeepSeek-R1-Distill-Qwen 1.5B or 7B)
    Attach a PEFT method (e.g., LoRA / DoRA / AdaLoRA / MiSS / …; Table 1)
    RLVR rollout generation (sample multiple completions per prompt using vLLM)
    Verifier-based reward (parse \boxed{} answer; check equivalence with latex2sympy + math verify)
    Compute GRPO/DAPO advantages from group rewards (Eq. (1))
    Update only adapter/trainable PEFT parameters using the RL objective
    Evaluate on math benchmarks with multi-sample metrics (Avg@k, Pass@k; Table 2).

3.3 Roadmap for the deep dive

  • Explain RLVR and the specific RL objective used (GRPO, plus DAPO / Dr. GRPO) because it determines the gradient signal PEFT must learn from.
  • Define the PEFT method families and what changes between them (structure vs initialization vs extreme compression) because that is the main experimental variable.
  • Describe the training pipeline end-to-end (data → rollouts → reward → update) to clarify what is held constant across methods.
  • Detail experimental controls (hyperparameters, targets, evaluation metrics) to interpret comparisons fairly.
  • Summarize the key mechanistic analysis used to explain failures (the “spectral collapse” analysis around Figure 3).

3.4 Detailed, sentence-based technical breakdown

  • Framing: This is an empirical benchmarking + mechanistic analysis paper whose core idea is to hold an RLVR training recipe fixed and systematically swap in different PEFT adapter designs to see which ones best support RL optimization from sparse rewards.

  • RLVR objective and what it means

  • RLVR trains the model using a verifier that gives a binary reward R ∈ {0,1} indicating whether the final answer is correct (Section 2.4).
  • The paper builds on GRPO (Group Relative Policy Optimization), which avoids training a separate critic by estimating advantages from a group of sampled responses for the same prompt (Section 2.1).
  • In plain language, Eq. (1) says: for each prompt, sample G candidate responses, compare their likelihood under the new policy vs an old snapshot, and adjust the policy to increase the probability of responses that got higher-than-average reward within that group, while using PPO-style clipping for stability.
  • Equation (1), paraphrased before notation: The training objective increases the probability of tokens in better-than-average responses, but caps the update size so learning does not become unstable.
  • Symbols (Eq. (1))
    • q is a prompt sampled from dataset D.
    • {o_i} are the G sampled responses under the old policy π_{θ_old}.
    • o_{i,t} is token t in response i, and o_{i,<t} is its prefix.
    • The ratio π_θ / π_{θ_old} is the PPO importance-sampling ratio.
    • Â_i is a standardized advantage computed from the group rewards:
      Â_i = (R_i - mean({R_j})) / std({R_j}).
    • ε is the PPO-style clip threshold; under DAPO it is modified (see below).
  • DAPO modifies GRPO to improve stability in long chain-of-thought settings by:
    • Using “clip-higher” with a larger upper clipping bound (the paper cites an example ε_high = 0.28) to allow uplifting low-probability exploration tokens (Section 2.1).
    • Dynamically filtering prompts where all sampled outputs get identical rewards (all 0 or all 1) to preserve gradient signal (Section 2.1).
  • Dr. GRPO changes GRPO by removing (i) the per-response length normalization term 1/|o_i| and (ii) the group-level standard deviation in the advantage, to address biases the paper describes (Section 2.1).

  • PEFT methods compared (what changes between methods)

  • The core baseline is standard LoRA (Section 2.2): freeze the pretrained weight W0 and learn a low-rank update ΔW = (α/r) B A.
    • Forward: h = W0 x + (α/r) B A x (Eq. (2)).
    • Initialization: A ~ N(0, σ^2), B ~ 0, so ΔW = 0 at step 0.
  • The paper groups PEFT methods into five categories (Figure 2; Table 1):

    • Baselines: full-parameter fine-tuning vs standard LoRA.
    • Structural variants: change the adapter structure, not just its init.
    • DoRA: decouples magnitude and direction via a normalization-like decomposition (Table 1).
    • AdaLoRA: uses an SVD-like factorization P Λ Q with adaptive rank/budget allocation (Table 1).
    • MiSS: uses a shard-sharing / sub-network style structure (Table 1 shows expand(D) with D initialized to 0).
    • Initialization strategies: keep W0 + B A, but change initialization/optimization.
    • PiSSA: initializes adapters using top singular components of W0 (Table 1).
    • MiLoRA: initializes using minor singular components of W0 (Table 1).
    • LoRA+: changes learning dynamics by using different learning rates for A vs B (Table 1 indicates η_B = λ η_A).
    • rsLoRA: uses a rank-stabilization scaling factor (Table 1).
    • Efficiency-oriented variants: reduce memory/training cost further.
    • LoRA-FA: freezes A and trains only B (Table 1).
    • VeRA: freezes random low-rank matrices and trains only scaling vectors (Table 1).
    • Other PEFT mechanisms:
    • IA3: trains multiplicative scaling vectors applied to activations (Table 1).
    • LayerNorm tuning: trains only LayerNorm gain/bias parameters (Table 1).
  • What happens first, second, third… (pipeline narrative)

  • Choose a base model: DeepSeek-R1-Distill-Qwen-1.5B or 7B (Section 2.3).
  • Attach one PEFT method and decide which parameters are trainable (Table 1), while freezing the base weights (except full fine-tuning).
  • Sample training prompts from open-r1/DAPO-Math-17k-Processed (~17.4k math queries; Section 2.3).
  • Generate rollouts: for each prompt, sample G = 8 completions (Section 2.4) using vLLM in co-location mode for throughput (Section 2.4).
  • Enforce an output format: reasoning in <think>...</think>, final answer in \boxed{} (Section 2.3), so the verifier can reliably extract the answer.
  • Compute a binary reward: parse the boxed answer, compare to ground truth with latex2sympy + math verify; reward is R=1 if equivalent else R=0 (Section 2.4).
  • Compute group-relative advantages from the G rewards and optimize the RL objective (Eq. (1)), using DAPO as the default RLVR algorithm (Section 2.1).
  • Update trainable parameters only (the adapters / scaling vectors / LayerNorm params, depending on the method), using the fixed training hyperparameters described below.
  • Evaluate the resulting model on held-out math benchmarks with multi-sample decoding and Avg@k / Pass@k metrics (Section 2.5; Table 2).

  • Core configurations / hyperparameters (as specified)

  • Adapter targeting: apply PEFT to all linear modules {q, k, v, o, gate, up, down} proj (Section 2.2).
  • PEFT config (shared): rank r = 32, dropout 0.05, alpha 64 for all PEFT methods (Section 2.2).
  • RLVR rollout count: G = 8 rollouts per prompt (Section 2.4).
  • Learning rate: constant 1e-5, no warmup (Section 2.4). (Additional LR ablations use 5e-6 and 1e-6; Figure 2 / Figure 4 / Table 5.)
  • Sequence lengths (training): max prompt length 512, completion length 16384 tokens (Section 2.4).
  • DAPO settings: epsilon 0.28 (“clip-higher”), and no KL coefficient (β not used) (Section 2.4).
  • Batching / steps:
    • 1.5B: per-device batch size 4, global batch size 128, 1024 steps, gradient accumulation 8 (Section 2.4).
    • 7B: per-device batch size 1, global batch size 32, 8192 steps, gradient accumulation 8 (Section 2.4).
  • Systems tooling: Accelerate with DeepSpeed ZeRO-2 (optimizer state offload) to reduce memory, and vLLM for rollout generation (Section 2.4).
  • Evaluation decoding: temperature 0.6, top-p 0.95, max tokens 32768, seed 42 (Section 2.5).
  • Optimizer / weight decay / model architecture details: The excerpt does not specify the optimizer type (e.g., AdamW settings), weight decay, or model architecture hyperparameters (layers, hidden size, heads, tokenizer). I therefore do not infer them.

  • Mechanistic analysis method: spectral “collapse/misalignment”

  • To explain why SVD-based initializations fail, the paper inspects how weight updates distribute across singular value components (Figure 3 left) and plots the cumulative energy explained by top components (Figure 3 center), alongside RL training curves (Figure 3 right).
  • The paper’s key interpretation is that PiSSA fails because it forces updates into principal components, which conflicts with the RLVR update tendencies they cite, while MiLoRA fails because its “minor component” initialization has negligible magnitude, so optimization quickly reorients updates back toward dominant principal directions (Section 3.1; Eq. (3); Figure 3).

4. Key Insights and Innovations

  • (1) Structural PEFT variants are better aligned with RLVR than standard LoRA
  • What is new here: A broad controlled comparison shows that changing adapter structure (not just rank or init) can consistently beat standard LoRA under RLVR.
  • Why it matters: Under the 1.5B setting, DoRA achieves the highest reported overall average accuracy and even exceeds full fine-tuning in this benchmark suite (Table 3).

  • (2) SVD-informed initialization can catastrophically fail in RLVR (“spectral collapse / misalignment”)

  • What is new here: The paper reports severe collapse for PiSSA and major degradation for MiLoRA under RLVR (Table 3), then uses spectral plots (Figure 3) to argue the failure is tied to mismatch between SVD-biased update directions and RL optimization dynamics.
  • Why it matters: These methods are often motivated as “better initialization” for fine-tuning, but this evaluation suggests they may be actively harmful under verifier-reward RL.

  • (3) An “expressivity floor”: extreme parameter reduction degrades reasoning

  • What is new here: The paper identifies a boundary where adapter capacity becomes too small to support RLVR-driven reasoning improvements.
  • Why it matters: Methods that only train tiny vector scalings (VeRA, IA3, sometimes LayerNorm-only tuning) can be much less effective than modest low-rank adapters, even if RLVR is sparse (Table 3; Table 4).

  • (4) Robustness checks across RLVR algorithms, rank, batch size, and LR

  • What is new here: Ablations (Figure 4; Table 5) suggest the high-level method ranking is not overly sensitive to swapping among GRPO, DAPO, and Dr. GRPO, and that moderate-to-higher ranks (e.g., 16, 32) are better than extreme low rank (1) for RLVR.

5. Experimental Analysis

  • Evaluation methodology
  • Training data: open-r1/DAPO-Math-17k-Processed with ~17.4k math queries (Section 2.3).
  • Reward: strict binary outcome reward via math equivalence verification (Section 2.4).
  • Benchmarks: AIME24 (30), AIME25 (30), AMC (40), HMMT (30), MATH500 (500), Minerva (272) (Table 2).
  • Metrics:
    • Avg@k: average accuracy across k generations per problem (Section 2.5; Table 2 uses k=32 for AIME/AMC/HMMT and k=4 for MATH500/Minerva).
    • Pass@k: whether at least one of the k generations is correct (Section 2.5; tables report both “Avg.” and “Pass” columns).
  • Controlled conditions: unified adapter targeting and shared LoRA hyperparameters (rank 32, dropout 0.05, alpha 64) across PEFT methods (Section 2.2), plus unified RLVR recipe (Section 2.4).

  • Main quantitative results (DeepSeek-R1-Distill-Qwen-1.5B)

  • Overall average accuracy (Table 3):
    • Base: 40.5
    • Full fine-tuning: 44.9
    • LoRA: 42.5
    • Structural:
    • DoRA: 46.6 (highest)
    • AdaLoRA: 44.2
    • MiSS: 43.4
    • Efficiency:
    • LoRA-FA: 43.0
    • VeRA: 40.7
    • Initialization:
    • LoRA+: 43.9
    • rsLoRA: 42.3
    • MiLoRA: 18.0
    • PiSSA: 0.2
    • Other:
    • IA3: 22.3
    • LN Tuning: 41.8
  • Notable benchmark-specific highlights (Table 3):

    • DoRA shows strong gains on AIME24 Avg (39.0) vs LoRA (33.2) and Full (34.9).
    • PiSSA is near-zero across benchmarks (e.g., AIME24 Avg 0.0, overall 0.2), consistent with collapse.
    • MiLoRA is severely degraded (overall 18.0), much worse than LoRA.
  • Expressivity vs trainable parameter fraction

  • Table 4 links performance to trainable parameter percentage:
    • Full: 100%44.9
    • LoRA: 1.55%42.5
    • MiSS: 0.99%43.4
    • LoRA rank 1: 0.0015%40.5
    • VeRA: 0.0029%40.7
    • LN Tuning: 0.0035%41.8
  • This supports the paper’s claim that moderate parameter reduction can work, but extreme reduction tends to cap reasoning gains.

  • Training dynamics

  • Figure 1 (right) plots “accuracy reward” over training steps and shows different convergence behaviors across methods.
  • Figure 3 (right) specifically shows MiLoRA and PiSSA collapsing compared to LoRA and Full.

  • Ablation results (LoRA-focused robustness checks)

  • Table 5 varies batch size, learning rate, rank, and RLVR algorithm.
  • Batch size (Table 5; Figure 5):
    • Bsz 32: overall 43.0 vs the baseline LoRA 42.5 (baseline uses global batch 128).
    • The paper notes the “small batch heuristic” from SFT is weaker here, and larger batch can help on AIME24.
  • Learning rate (Table 5):
    • 1e-5: overall 42.3
    • 5e-6: overall 42.3 (same overall in the table, with different per-benchmark numbers)
  • Rank (Table 5):
    • r=1: overall 40.5 (worse)
    • r=8: overall 42.3
    • r=16: overall 43.9 (best among shown ranks)
  • RLVR algorithm (Table 5):

    • Dr. GRPO: overall 42.0
    • GRPO: overall 40.5
    • The paper interprets differences as not statistically significant overall, emphasizing “algorithmic invariance,” though the excerpt does not include standard deviations/confidence intervals (it says “std dev removed for clarity”).
  • Scaling to 7B

  • Table 6 (DeepSeek-R1-Distill-Qwen-7B) reports overall average accuracies:
    • LoRA: 54.8
    • DoRA: 55.0
    • MiSS: 53.4
    • LoRA+: 55.5 (highest in Table 6)
  • The relative ordering largely persists: DoRA/LoRA+ slightly exceed LoRA, suggesting the “structural/optimization-aware” benefits are not confined to 1.5B.

  • Do experiments support the claims?

  • The tables clearly support:
    • Structural variants (DoRA, AdaLoRA, MiSS) ≥ LoRA on overall average in the 1.5B suite (Table 3).
    • SVD-based inits (PiSSA, MiLoRA) are dramatically worse (Table 3) with collapse-like curves (Figure 3 right).
    • Extreme compression methods (notably IA3) can be far worse (Table 3), consistent with an expressivity bottleneck framing.
  • What is less fully supported in the excerpt:
    • The mechanistic “off-principal” explanation is plausible within their spectral plots (Figure 3), but depends on broader theoretical claims (they cite Zhu et al., 2025) that are not derived in this excerpt. The paper’s own Figure 3 provides correlational evidence (update spectra + collapse), not a proof.

6. Limitations and Trade-offs

  • Model and domain scope
  • Only two base models are evaluated: DeepSeek-R1-Distill-Qwen 1.5B and 7B (Section 2.3).
  • Tasks are exclusively mathematical reasoning benchmarks (Table 2), so it is unclear how conclusions transfer to non-math domains (coding, instruction following, dialog, etc.).

  • RLVR recipe specificity

  • Default training uses DAPO with ε = 0.28 and no KL term (β absent), plus very long completion length (16384 training; 32768 eval). Different RLVR stabilization choices (e.g., KL-regularization) might change adapter behavior, but are not explored beyond algorithm ablations.

  • Training horizon differences

  • The 1.5B runs use 1024 steps while the 7B runs use 8192 steps (Section 2.4). This makes “scaling” comparisons informative but not perfectly controlled for step budget.

  • Incomplete reporting for full reproducibility (from the provided excerpt)

  • The excerpt does not specify optimizer type, optimizer hyperparameters, or full system hardware/compute budget (GPU type, total FLOPs/PF-days). This constrains how precisely one can replicate or normalize results.

  • Mechanistic claims vs evidence

  • The “spectral misalignment” explanation uses spectral plots (Figure 3) and a qualitative gradient-flow argument (Eq. (3)), but the excerpt does not include alternative hypotheses tests (e.g., numerically fixing MiLoRA init magnitude to see if collapse disappears), so causality is suggestive rather than definitive.

7. Implications and Future Directions

  • How this changes practice
  • For RLVR on math reasoning in this setup, standard LoRA is not the best default: structural methods like DoRA (and also AdaLoRA, MiSS) provide a better accuracy–parameter trade-off (Table 3; Figure 1).
  • LoRA+ appears robust and scales well to 7B (Table 6), suggesting that optimization-dynamics tweaks (learning rate ratios) can matter as much as structural changes.

  • What follow-up research it enables

  • The paper’s future work section proposes:

    • Migrating to higher-performance RL infrastructure (from TRL to something like “VeRL”) to scale experiments further (Section 4).
    • Deeper mechanistic interpretability of adapter dynamics, especially the spectral evolution arguments behind why DoRA works and why SVD-based init collapses (Section 4).
    • Testing broader environments: multimodal, multi-turn, asynchronous RL, and deployment issues like weight merging stability (Section 4).
  • Practical applications / downstream use

  • Efficient RLVR training for reasoning-centric assistants where correctness is verifiable (e.g., math solvers) benefits from picking adapters that preserve enough “plasticity” while staying parameter-efficient.

  • Repro/Integration Guidance (when to prefer what, based on this paper’s evidence)

  • If you are doing RLVR with verifiable binary rewards on reasoning tasks similar to these benchmarks:
    • Prefer DoRA as a strong default when you want maximum accuracy among PEFTs; on 1.5B it even exceeds full fine-tuning in overall average (46.6 vs 44.9, Table 3).
    • Consider AdaLoRA or MiSS if you want structural variants that beat LoRA but may have different parameter fractions (MiSS is 0.99% trainable in Table 4).
    • Consider LoRA+ as an optimization-stable variant; it is competitive at 1.5B (43.9, Table 3) and best among the listed methods at 7B (55.5, Table 6).
    • Avoid SVD-based init methods like PiSSA and MiLoRA under this RLVR recipe due to severe collapse (0.2 and 18.0, Table 3; Figure 3).
    • Avoid extreme compression (e.g., IA3 at 22.3, Table 3) if your goal is reasoning improvement; the paper’s data suggests you need a minimum adapter capacity (“expressivity floor,” Table 4).
  • If you must reduce parameters aggressively for hardware reasons, the results suggest moderate reductions like LoRA-FA can remain competitive (43.0, Table 3) compared to methods that only train scaling vectors.