JustRL: Scaling a 1.5B LLM with a Simple RL Recipe¶
ArXiv: 2512.16649
🎯 Pitch¶
JustRL demonstrates that a minimal, single-stage RL recipe with fixed hyperparameters can scale two 1.5B reasoning LLMs to state-of-the-art math performance (54.9% and 64.3% avg across nine benchmarks) while using roughly half the compute of more complex pipelines. By showing smooth, monotonic training without multi-stage schedules, dynamic tricks, or length penalties—and by reporting ablations where common ‘stabilization’ tricks hurt—this work argues for simpler, more reproducible baselines that unlock further gains beyond distillation limits.
1. Executive Summary (2-3 sentences)¶
JustRL shows that a very simple, single-stage reinforcement learning setup with fixed hyperparameters can scale two 1.5B “reasoning” language models to strong math-benchmark performance while using substantially less compute than many recent multi-trick pipelines. Concretely, it reports 54.87% average accuracy (nine math benchmarks) for JustRL-DeepSeek-1.5B and 64.32% for JustRL-Nemotron-1.5B, with smooth training curves over 3,000–4,000+ steps and no need for stage schedules or mid-training interventions (Abstract; §4; Tables 3–6; Figures 1–3).
2. Context and Motivation¶
- Problem / gap.
- Recent RL-for-reasoning work on smaller language models (SLMs) often accumulates many stabilization “tricks” (multi-stage training, dynamic schedules, curriculum learning, length penalties, dynamic sampling, reference resets, etc.), making it unclear which components are truly necessary or beneficial (§1–§2; Table 1).
-
The paper asks: is this complexity necessary for stable, high-performing RL training at ~1.5B parameters? (Abstract; §1).
-
Why it matters.
- In practice, industry often favors distillation (supervised fine-tuning on a stronger teacher’s outputs) for small models because it is efficient and stable (§1).
-
However, distillation is described here as bounded by the teacher’s capability; once the teacher plateaus, the student may plateau too, and RL can be a path to continued improvement (§1).
-
Prior approaches and their shortcomings (as positioned here).
- Many recent methods pursue stability/performance by stacking techniques:
- Progressive context length schedules (e.g.,
8K → 16K → 24K) and multiple stages (e.g., DeepScaleR, FastCuRL, ProRL) (§2). - Length penalties, adaptive temperature, dynamic sampling/filtering, “rollout rescue,” reset KL references, etc. (Table 1; §2).
- Progressive context length schedules (e.g.,
-
The paper’s critique is methodological: once baselines are already complex, new techniques may be compensating for instabilities introduced by earlier complexity, rather than addressing fundamental RL difficulties (§1).
-
How this paper positions itself.
- It proposes a minimal RL baseline—single-stage, fixed hyperparameters, standard dataset, simple prompt—and evaluates it on two different 1.5B backbones to argue that stability and competitiveness can emerge from “stable fundamentals at adequate scale” (§3–§4).
- It emphasizes not claiming optimality, but establishing a simple, validated baseline and showing some “standard tricks” can backfire in this setting (Abstract; §4.4; §5; Limitations).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a post-training reinforcement learning pipeline that takes an already-distilled 1.5B language model and further improves it using verifiable rewards on math problems.
- It solves this by running single-stage RL training (no curricula or stage transitions) with a fixed set of hyperparameters and a simple verifier-based reward (§3.1; Table 2).
3.2 Big-picture architecture (diagram in words)¶
- Inputs: math problems from
DAPO-Math-17kand a base model (DeepSeek-R1-Distill-Qwen-1.5BorOpenMath-Nemotron-1.5B) (§3.1). - Rollout generation: the model samples multiple candidate solutions per problem (
Rollout N = 8) (§3.1; Table 2). - Verification + reward: a lightweight rule-based verifier produces a binary outcome reward (correct/incorrect) (§3.1).
- Policy update:
GRPO(inveRL) updates model weights using clipped policy-gradient-style updates, with fixed hyperparameters throughout training (§3.1; Table 2). - Evaluation: performance is measured on nine math benchmarks using a pass@1-style accuracy aggregated over multiple sampled responses (
@32for most tasks;@4for three) with a specified decoding setup (§3.2; Tables 3 and 5).
3.3 Roadmap for the deep dive¶
- I will explain:
- The RL setting used here (verifiable, binary rewards; what “rollouts” mean).
- The specific algorithmic choice (
GRPOinveRL) at the level described in the provided text (since equations are not included here). - The training “minimalism” constraints (single-stage, fixed hyperparameters, data, prompt, length handling).
- The evaluation protocol and how scores like
avg@32are computed. - The stability signals the paper tracks (entropy, reward, response length) and what they indicate.
3.4 Detailed, sentence-based technical breakdown¶
-
Framing. This is an empirical training-recipe / system paper: the core idea is that a carefully chosen but simple RL setup (single-stage + fixed hyperparameters + lightweight verifier reward) can scale to strong reasoning performance without the layered complexity common in recent SLM RL pipelines (Abstract; §3–§4).
-
RL problem setup (what is optimized).
- The model is trained on math question-answer data (
DAPO-Math-17k) by generating candidate responses and receiving a reward determined by whether the answer is verified as correct (§3.1). -
The reward is explicitly described as binary outcome rewards, meaning each rollout is scored as success/failure rather than receiving a graded partial-credit score (§3.1).
-
System/data pipeline diagram in words (explicit sequence).
- Sample a batch of problems from
DAPO-Math-17k(“Standard data” with no offline difficulty filtering or online dynamic sampling in this recipe) (§3.1). - For each problem, prompt the model with a fixed, simple suffix:
> “Please reason step by step, and put your final answer within\boxed{}.” (§3.1) - For each problem, generate multiple candidate solutions (the paper sets
Rollout N = 8) at training time (§3.1; Table 2). - For each generated solution, apply a lightweight rule-based verifier (from
DAPO) to check correctness and produce a binary reward; the verifier explicitly avoids symbolic math libraries likeSymPyto reduce overhead (§3.1). - Use
GRPO(implemented inveRL) to compute advantages / update the policy from these sampled rollouts and rewards (§3.1). -
Repeat this process continuously for thousands of steps with no stage transitions or schedule changes, producing stable training curves on tracked signals (Figures 1–2; §4.3).
-
Core training constraints (“what we keep simple”).
- Single-stage training: there is no progressive context-length increase, no curriculum switching, and no “split stages” (§3.1).
- Fixed hyperparameters: no adaptive temperature schedule, no dynamic batch size, and no mid-training KL reference resets (§3.1).
- No explicit length penalty term: they set a maximum context length (
16K) rather than adding a penalty to the reward/objective to discourage long outputs (§3.1). -
Standard dataset usage: they train directly on
DAPO-Math-17kwithout difficulty filtering or dynamic sampling strategies (§3.1). -
The one technique they do keep (stability).
- They state they use “clip higher” as a stability practice for long-horizon RL training and treat it as baseline rather than an extra trick (§3.1).
-
The paper later links this to stable entropy behavior (Figure 2(a); §4.3), implying it helps avoid “exploration collapse” or other entropy pathologies.
-
Hyperparameters and configuration (as provided). Table 2 lists the fixed configuration used for both models:
Advantage Estimator:GRPOUse KL Loss:NoUse Entropy Regularization:NoTrain Batch Size:256Max Prompt Length:1kMax Response Length:15kPPO Mini Batch Size:64PPO Micro Batch Size/GPU:1Clip Ratio Range:[0.8, 1.28]Learning Rate:1e-6(constant)Temperature:1.0(training)Rollout N:8Reward Function:DAPOverifier-
Note: The provided excerpt does not include model architecture details (layers, hidden size, heads), tokenizer, optimizer name beyond what
veRLuses internally, or learning-rate scheduler beyond “constant,” so those cannot be specified here. -
Compute / hardware (as provided).
- Training runs use
32 × A800-80GBGPUs for approximately~15 daysfor each of the two backbones (§3.1). -
Step counts reported:
4,380steps forJustRL-DeepSeek-1.5B(§4.1) and3,440steps forJustRL-Nemotron-1.5B(§4.2). -
Evaluation protocol (how metrics are computed).
- They evaluate on nine math benchmarks using scripts from
POLARIS(§3.2). - They report Pass@1 accuracy but computed by sampling N responses per problem and averaging (“avg@N” style):
N = 4forMATH-500,Minerva Math, andOlympiadBench.N = 32for all other benchmarks (§3.2; see also captions of Tables 3 and 5).
- Generation/eval decoding uses
temperature = 0.7,top-p = 0.9, and allows up to32Ktokens at evaluation time (§3.2). -
They also mention augmenting evaluation systems with
CompassVerifier-3Bto address false negatives from rule-based verifiers (§3.2). The excerpt does not fully spell out whether this verifier is used for training rewards, evaluation scoring, or both; the reward function for training is explicitly “DAPO” (Table 2), whileCompassVerifier-3Bis described inside the evaluation protocol (§3.2). Assumption (minimal):CompassVerifier-3Bis used to improve evaluation reliability rather than to replace the training reward, because Table 2 ties training reward to DAPO. -
Concrete micro-walkthrough (illustrative, consistent with the paper’s setup).
- Suppose a batch contains a single math problem.
- The model generates
8candidate step-by-step solutions (rollouts) under the fixed prompt suffix. - The verifier checks each rollout’s final boxed answer and assigns reward
1if correct and0otherwise (binary outcome). GRPOuses these rollouts and rewards to update the model so that solution styles that correlate with verified correctness become more likely in future sampling, subject to the clipping constraints (Table 2; §3.1).
4. Key Insights and Innovations¶
- 1) “Simplicity at scale” baseline that matches/edges complex pipelines.
- The central contribution is showing that single-stage RL with fixed hyperparameters can be competitive with (or slightly better than) multi-stage, dynamically scheduled recipes on two different 1.5B backbones (Abstract; §4.1–§4.2; Tables 3 and 5).
-
The novelty is not a new RL algorithm, but a validated minimal recipe intended to clarify what is “fundamentally sufficient” before adding complexity (§1; §5).
-
2) Cross-backbone hyperparameter transfer without tuning.
- The exact same hyperparameters (Table 2) reportedly work for both
DeepSeek-R1-Distill-Qwen-1.5BandOpenMath-Nemotron-1.5Bwithout per-model tuning (§3.1; §4.2). -
This is significant because many prior works emphasize tuning/schedules; here, robustness is argued via transfer across two starting points.
-
3) Stable training dynamics without standard interventions.
- The paper highlights smooth, monotonic improvement over
4,000+steps (Figure 1(a); §4.1) and stable internal signals—entropy oscillating without collapse, reward rising, response length compressing naturally (Figure 2; §4.3). -
This matters because much recent complexity is motivated by instability modes like reward collapse, entropy drift, and length explosion (§1; §4.3).
-
4) Negative ablations: “standard tricks” can degrade performance by collapsing exploration.
- Adding an explicit overlong penalty and/or a more robust verifier degrades AIME24 performance and reduces entropy to ~
0.5–0.6, which the paper interprets as collapsed exploration (Figure 3; §4.4). - The takeaway is that techniques are not universally beneficial; they can interact destructively with an otherwise stable baseline.
5. Experimental Analysis¶
- Evaluation methodology.
- Benchmarks (9):
AIME 2024,AIME 2025,AMC 2023,MATH-500,Minerva Math,OlympiadBench,HMMT Feb 2025,CMIMC 2025,BRUMO 2025(§3.2). - Metric: “Pass@1 accuracy” aggregated via multiple samples per problem (
@32for most;@4for three) (§3.2; Tables 3 and 5). - Decoding:
temp 0.7,top-p 0.9,max 32Ktokens (§3.2). - Baselines compared (as shown in tables):
- On DeepSeek backbone:
DeepScaleR-1.5B,ProRL-V2, and partiallyBroRL(some benchmarks missing) (Table 3). - On Nemotron backbone:
QuestA(Table 5).
- On DeepSeek backbone:
-
Compute comparison: Token-budget approximations and training settings are summarized in Tables 4 and 6 (with notes about dynamic sampling assumptions).
-
Main quantitative results (with specific numbers).
- DeepSeek backbone (Table 3; §4.1):
- Backbone average:
37.65. ProRL-V2average:53.08.JustRL-DeepSeekaverage:54.87.- Example per-benchmark:
AIME2452.60,AIME2538.75,AMC2391.02,MATH91.65,BRUMO52.71(Table 3).
- Backbone average:
- Nemotron backbone (Table 5; §4.2):
- Backbone average:
56.74. QuestAaverage:63.81.JustRL-Nemotronaverage:64.32.- Example per-benchmark:
AMC2396.02,MATH94.15,Olympiad76.59,CMIMC41.72(Table 5).
- Backbone average:
-
Training curves (Figure 1):
- Figure 1(a) highlights
AIME24 (avg@32)improving from roughly28%(base) to about58%after4,000+steps forJustRL-DeepSeek-1.5B. - Figure 1(b) shows
JustRL-Nemotron-1.5Breaching70+%(AIME24 avg@32) over3,000+steps.
- Figure 1(a) highlights
-
Compute / efficiency claims and supporting evidence.
- DeepSeek backbone: Table 4 reports
JustRL-DeepSeektoken budget (approx.)1.4×10^8kvsProRL-V22.8×10^8k, andBroRL6.8×10^8k. (The table uses aksuffix; the excerpt does not define it explicitly, but it strongly suggests “thousands of tokens.”) - Nemotron backbone: Table 6 reports
JustRL-Nemotrontoken budget (approx.)1.1×10^8kvsQuestA2.6×10^8k, described as2.4×less compute in the caption. -
Both compute tables include a note about dynamic sampling methods and an estimated
50% filter ratiofollowingPOLARIS, which affects token-budget estimates (Tables 4 and 6; §4.1 “Note on dynamic sampling”). -
Do experiments support the claims?
- Competitiveness with simplicity: Yes within the shown comparisons: JustRL slightly exceeds
ProRL-V2on the DeepSeek backbone average (Table 3) and slightly exceedsQuestAon the Nemotron backbone average (Table 5). - Stability claim: The paper provides internal training dynamics for
JustRL-DeepSeek-1.5B(Figure 2; §4.3) and qualitative “smooth curve” evidence in Figure 1 for both backbones. - Transfer without tuning: Supported by “same hyperparameters” statement (§3.1; §4.2) plus two-backbone results.
-
Caveat: The excerpt itself notes they “don’t have the computational resources to run extensive controlled comparisons” for some stability contrasts (§4.3), so stability superiority vs complex methods is argued more by absence of pathologies in their runs than by head-to-head controlled ablations against each trick.
-
Ablations / robustness checks.
- Two ablations on
JustRL-DeepSeek-1.5B(trained3,000+steps) (Figure 3; §4.4):- Adding overlong penalty (explicit length penalty for last
4ktokens) reduces AIME24 plateau to ~50%vs ~55%baseline. - Adding overlong penalty + robust verifier reduces plateau further to ~
45%.
- Adding overlong penalty (explicit length penalty for last
- The entropy plot (Figure 3(b)) shows entropy collapsing to ~
0.5–0.6under these modifications, versus ~1.2–1.4oscillation in the base recipe (§4.4).
6. Limitations and Trade-offs¶
- Scope limitations (explicit).
- Results are limited to mathematical reasoning tasks and two backbones at ~1.5B parameters; generalization to other domains (coding, general QA) and other scales is not explored (Limitations; §5).
-
The paper cannot definitively isolate which components are most critical (hyperparameters vs verifier vs training data vs interactions) (Limitations; §5).
-
Experimental/control limitations.
- The paper highlights that it cannot run extensive controlled comparisons to isolate which complex techniques cause or solve instability (§4.3).
-
Negative ablations are limited to two modifications; many other interventions (curriculum learning, adaptive temperature, reference resets, alternative verifiers, data augmentation) are not explored (§4.4).
-
Compute accessibility trade-off.
-
Even though compute is lower than some baselines, the reported training uses
32 × A800-80GBfor~15 days, which may still be prohibitive for many researchers (Limitations; §3.1). -
Potential trade-off implied by ablations: exploration vs constraint.
- The ablations suggest that explicit length penalties and a more “robust” verifier can unintentionally reduce exploration (entropy collapse) and harm final performance (Figure 3; §4.4). This implies a delicate balance: constraints intended to stabilize or shape behavior may prematurely narrow the policy.
7. Implications and Future Directions¶
- How this changes the field’s workflow (as suggested here).
- The paper argues for a methodological shift: establish strong, simple baselines first, then add complexity only when it demonstrably solves a specific observed problem (§5–§6).
-
It suggests that some instability issues motivating complex pipelines may not be fundamental, but artifacts of certain design choices layered on top of each other (§5).
-
What follow-up research it enables/suggests.
- Component attribution: controlled studies to disentangle which factor(s) are responsible for stability and gains (hyperparameters, verifier strictness, data distribution, rollout count) (Limitations; §5).
- Generality tests: apply the same fixed recipe to other domains (coding) and other model sizes to see where simplicity breaks down (Limitations).
-
Longer-horizon scaling: test whether stability persists beyond the
3,000–4,380step horizons reported (Limitations). -
Practical applications / downstream use cases.
-
For practitioners with a distilled 1–2B math reasoning model that is plateauing, the paper provides a concrete “simple RL” baseline: train on
DAPO-Math-17kwithGRPOinveRL, binary verifier rewards,Rollout N=8,16Kcontext cap, and the fixed hyperparameters in Table 2 (§3.1–§3.2). -
Repro/Integration Guidance (when to prefer this approach, per the paper’s evidence).
- Prefer this recipe when:
- You want a low-complexity RL post-training baseline without multi-stage curricula, dynamic schedules, or data filtering (§3.1; Table 1).
- You care about stable training dynamics (monotonic improvement, stable entropy, natural response-length convergence) and want to avoid adding constraints that might collapse exploration (Figures 2–3; §4.3–§4.4).
- Be cautious about adding:
- Explicit length penalties or swapping in a “robust verifier” without re-validating, because the provided ablations show these can reduce entropy and hurt performance in this setup (Figure 3; §4.4).