JustRL: Scaling a 1.5B LLM with a Simple RL Recipe¶

🎯 Pitch¶

JustRL demonstrates that a minimal, single-stage RL recipe with fixed hyperparameters can scale two 1.5B reasoning LLMs to state-of-the-art math performance (54.9% and 64.3% avg across nine benchmarks) while using roughly half the compute of more complex pipelines. By showing smooth, monotonic training without multi-stage schedules, dynamic tricks, or length penalties—and by reporting ablations where common ‘stabilization’ tricks hurt—this work argues for simpler, more reproducible baselines that unlock further gains beyond distillation limits.

1. Executive Summary (2-3 sentences)¶

JustRL shows that a very simple, single-stage reinforcement learning setup with fixed hyperparameters can scale two 1.5B “reasoning” language models to strong math-benchmark performance while using substantially less compute than many recent multi-trick pipelines. Concretely, it reports 54.87% average accuracy (nine math benchmarks) for JustRL-DeepSeek-1.5B and 64.32% for JustRL-Nemotron-1.5B, with smooth training curves over 3,000–4,000+ steps and no need for stage schedules or mid-training interventions (Abstract; §4; Tables 3–6; Figures 1–3).

2. Context and Motivation¶

Problem / gap.
Recent RL-for-reasoning work on smaller language models (SLMs) often accumulates many stabilization “tricks” (multi-stage training, dynamic schedules, curriculum learning, length penalties, dynamic sampling, reference resets, etc.), making it unclear which components are truly necessary or beneficial (§1–§2; Table 1).
The paper asks: is this complexity necessary for stable, high-performing RL training at ~1.5B parameters? (Abstract; §1).
Why it matters.
In practice, industry often favors distillation (supervised fine-tuning on a stronger teacher’s outputs) for small models because it is efficient and stable (§1).
However, distillation is described here as bounded by the teacher’s capability; once the teacher plateaus, the student may plateau too, and RL can be a path to continued improvement (§1).
Prior approaches and their shortcomings (as positioned here).
Many recent methods pursue stability/performance by stacking techniques:
- Progressive context length schedules (e.g., 8K → 16K → 24K) and multiple stages (e.g., DeepScaleR, FastCuRL, ProRL) (§2).
- Length penalties, adaptive temperature, dynamic sampling/filtering, “rollout rescue,” reset KL references, etc. (Table 1; §2).
The paper’s critique is methodological: once baselines are already complex, new techniques may be compensating for instabilities introduced by earlier complexity, rather than addressing fundamental RL difficulties (§1).
How this paper positions itself.
It proposes a minimal RL baseline—single-stage, fixed hyperparameters, standard dataset, simple prompt—and evaluates it on two different 1.5B backbones to argue that stability and competitiveness can emerge from “stable fundamentals at adequate scale” (§3–§4).
It emphasizes not claiming optimality, but establishing a simple, validated baseline and showing some “standard tricks” can backfire in this setting (Abstract; §4.4; §5; Limitations).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a post-training reinforcement learning pipeline that takes an already-distilled 1.5B language model and further improves it using verifiable rewards on math problems.
It solves this by running single-stage RL training (no curricula or stage transitions) with a fixed set of hyperparameters and a simple verifier-based reward (§3.1; Table 2).

3.2 Big-picture architecture (diagram in words)¶

Inputs: math problems from DAPO-Math-17k and a base model (DeepSeek-R1-Distill-Qwen-1.5B or OpenMath-Nemotron-1.5B) (§3.1).
Rollout generation: the model samples multiple candidate solutions per problem (Rollout N = 8) (§3.1; Table 2).
Verification + reward: a lightweight rule-based verifier produces a binary outcome reward (correct/incorrect) (§3.1).
Policy update: GRPO (in veRL) updates model weights using clipped policy-gradient-style updates, with fixed hyperparameters throughout training (§3.1; Table 2).
Evaluation: performance is measured on nine math benchmarks using a pass@1-style accuracy aggregated over multiple sampled responses (@32 for most tasks; @4 for three) with a specified decoding setup (§3.2; Tables 3 and 5).

3.3 Roadmap for the deep dive¶

I will explain:
The RL setting used here (verifiable, binary rewards; what “rollouts” mean).
The specific algorithmic choice (GRPO in veRL) at the level described in the provided text (since equations are not included here).
The training “minimalism” constraints (single-stage, fixed hyperparameters, data, prompt, length handling).
The evaluation protocol and how scores like avg@32 are computed.
The stability signals the paper tracks (entropy, reward, response length) and what they indicate.

3.4 Detailed, sentence-based technical breakdown¶

Framing. This is an empirical training-recipe / system paper: the core idea is that a carefully chosen but simple RL setup (single-stage + fixed hyperparameters + lightweight verifier reward) can scale to strong reasoning performance without the layered complexity common in recent SLM RL pipelines (Abstract; §3–§4).
RL problem setup (what is optimized).
The model is trained on math question-answer data (DAPO-Math-17k) by generating candidate responses and receiving a reward determined by whether the answer is verified as correct (§3.1).
The reward is explicitly described as binary outcome rewards, meaning each rollout is scored as success/failure rather than receiving a graded partial-credit score (§3.1).
System/data pipeline diagram in words (explicit sequence).
Sample a batch of problems from DAPO-Math-17k (“Standard data” with no offline difficulty filtering or online dynamic sampling in this recipe) (§3.1).
For each problem, prompt the model with a fixed, simple suffix:
> “Please reason step by step, and put your final answer within \boxed{}.” (§3.1)
For each problem, generate multiple candidate solutions (the paper sets Rollout N = 8) at training time (§3.1; Table 2).
For each generated solution, apply a lightweight rule-based verifier (from DAPO) to check correctness and produce a binary reward; the verifier explicitly avoids symbolic math libraries like SymPy to reduce overhead (§3.1).
Use GRPO (implemented in veRL) to compute advantages / update the policy from these sampled rollouts and rewards (§3.1).
Repeat this process continuously for thousands of steps with no stage transitions or schedule changes, producing stable training curves on tracked signals (Figures 1–2; §4.3).
Core training constraints (“what we keep simple”).
Single-stage training: there is no progressive context-length increase, no curriculum switching, and no “split stages” (§3.1).
Fixed hyperparameters: no adaptive temperature schedule, no dynamic batch size, and no mid-training KL reference resets (§3.1).
No explicit length penalty term: they set a maximum context length (16K) rather than adding a penalty to the reward/objective to discourage long outputs (§3.1).
Standard dataset usage: they train directly on DAPO-Math-17k without difficulty filtering or dynamic sampling strategies (§3.1).
The one technique they do keep (stability).
They state they use “clip higher” as a stability practice for long-horizon RL training and treat it as baseline rather than an extra trick (§3.1).
The paper later links this to stable entropy behavior (Figure 2(a); §4.3), implying it helps avoid “exploration collapse” or other entropy pathologies.
Hyperparameters and configuration (as provided). Table 2 lists the fixed configuration used for both models:
Advantage Estimator: GRPO
Use KL Loss: No
Use Entropy Regularization: No
Train Batch Size: 256
Max Prompt Length: 1k
Max Response Length: 15k
PPO Mini Batch Size: 64
PPO Micro Batch Size/GPU: 1
Clip Ratio Range: [0.8, 1.28]
Learning Rate: 1e-6 (constant)
Temperature: 1.0 (training)
Rollout N: 8
Reward Function: DAPO verifier
Note: The provided excerpt does not include model architecture details (layers, hidden size, heads), tokenizer, optimizer name beyond what veRL uses internally, or learning-rate scheduler beyond “constant,” so those cannot be specified here.
Compute / hardware (as provided).
Training runs use 32 × A800-80GB GPUs for approximately ~15 days for each of the two backbones (§3.1).
Step counts reported: 4,380 steps for JustRL-DeepSeek-1.5B (§4.1) and 3,440 steps for JustRL-Nemotron-1.5B (§4.2).
Evaluation protocol (how metrics are computed).
They evaluate on nine math benchmarks using scripts from POLARIS (§3.2).
They report Pass@1 accuracy but computed by sampling N responses per problem and averaging (“avg@N” style):
- N = 4 for MATH-500, Minerva Math, and OlympiadBench.
- N = 32 for all other benchmarks (§3.2; see also captions of Tables 3 and 5).
Generation/eval decoding uses temperature = 0.7, top-p = 0.9, and allows up to 32K tokens at evaluation time (§3.2).
They also mention augmenting evaluation systems with CompassVerifier-3B to address false negatives from rule-based verifiers (§3.2). The excerpt does not fully spell out whether this verifier is used for training rewards, evaluation scoring, or both; the reward function for training is explicitly “DAPO” (Table 2), while CompassVerifier-3B is described inside the evaluation protocol (§3.2). Assumption (minimal): CompassVerifier-3B is used to improve evaluation reliability rather than to replace the training reward, because Table 2 ties training reward to DAPO.
Concrete micro-walkthrough (illustrative, consistent with the paper’s setup).
Suppose a batch contains a single math problem.
The model generates 8 candidate step-by-step solutions (rollouts) under the fixed prompt suffix.
The verifier checks each rollout’s final boxed answer and assigns reward 1 if correct and 0 otherwise (binary outcome).
GRPO uses these rollouts and rewards to update the model so that solution styles that correlate with verified correctness become more likely in future sampling, subject to the clipping constraints (Table 2; §3.1).

4. Key Insights and Innovations¶

1) “Simplicity at scale” baseline that matches/edges complex pipelines.
The central contribution is showing that single-stage RL with fixed hyperparameters can be competitive with (or slightly better than) multi-stage, dynamically scheduled recipes on two different 1.5B backbones (Abstract; §4.1–§4.2; Tables 3 and 5).
The novelty is not a new RL algorithm, but a validated minimal recipe intended to clarify what is “fundamentally sufficient” before adding complexity (§1; §5).
2) Cross-backbone hyperparameter transfer without tuning.
The exact same hyperparameters (Table 2) reportedly work for both DeepSeek-R1-Distill-Qwen-1.5B and OpenMath-Nemotron-1.5B without per-model tuning (§3.1; §4.2).
This is significant because many prior works emphasize tuning/schedules; here, robustness is argued via transfer across two starting points.
3) Stable training dynamics without standard interventions.
The paper highlights smooth, monotonic improvement over 4,000+ steps (Figure 1(a); §4.1) and stable internal signals—entropy oscillating without collapse, reward rising, response length compressing naturally (Figure 2; §4.3).
This matters because much recent complexity is motivated by instability modes like reward collapse, entropy drift, and length explosion (§1; §4.3).
4) Negative ablations: “standard tricks” can degrade performance by collapsing exploration.
Adding an explicit overlong penalty and/or a more robust verifier degrades AIME24 performance and reduces entropy to ~0.5–0.6, which the paper interprets as collapsed exploration (Figure 3; §4.4).
The takeaway is that techniques are not universally beneficial; they can interact destructively with an otherwise stable baseline.

5. Experimental Analysis¶

Evaluation methodology.
Benchmarks (9): AIME 2024, AIME 2025, AMC 2023, MATH-500, Minerva Math, OlympiadBench, HMMT Feb 2025, CMIMC 2025, BRUMO 2025 (§3.2).
Metric: “Pass@1 accuracy” aggregated via multiple samples per problem (@32 for most; @4 for three) (§3.2; Tables 3 and 5).
Decoding: temp 0.7, top-p 0.9, max 32K tokens (§3.2).
Baselines compared (as shown in tables):
- On DeepSeek backbone: DeepScaleR-1.5B, ProRL-V2, and partially BroRL (some benchmarks missing) (Table 3).
- On Nemotron backbone: QuestA (Table 5).
Compute comparison: Token-budget approximations and training settings are summarized in Tables 4 and 6 (with notes about dynamic sampling assumptions).
Main quantitative results (with specific numbers).
DeepSeek backbone (Table 3; §4.1):
- Backbone average: 37.65.
- ProRL-V2 average: 53.08.
- JustRL-DeepSeek average: 54.87.
- Example per-benchmark: AIME24 52.60, AIME25 38.75, AMC23 91.02, MATH 91.65, BRUMO 52.71 (Table 3).
Nemotron backbone (Table 5; §4.2):
- Backbone average: 56.74.
- QuestA average: 63.81.
- JustRL-Nemotron average: 64.32.
- Example per-benchmark: AMC23 96.02, MATH 94.15, Olympiad 76.59, CMIMC 41.72 (Table 5).
Training curves (Figure 1):
- Figure 1(a) highlights AIME24 (avg@32) improving from roughly 28% (base) to about 58% after 4,000+ steps for JustRL-DeepSeek-1.5B.
- Figure 1(b) shows JustRL-Nemotron-1.5B reaching 70+% (AIME24 avg@32) over 3,000+ steps.
Compute / efficiency claims and supporting evidence.
DeepSeek backbone: Table 4 reports JustRL-DeepSeek token budget (approx.) 1.4×10^8k vs ProRL-V2 2.8×10^8k, and BroRL 6.8×10^8k. (The table uses a k suffix; the excerpt does not define it explicitly, but it strongly suggests “thousands of tokens.”)
Nemotron backbone: Table 6 reports JustRL-Nemotron token budget (approx.) 1.1×10^8k vs QuestA 2.6×10^8k, described as 2.4× less compute in the caption.
Both compute tables include a note about dynamic sampling methods and an estimated 50% filter ratio following POLARIS, which affects token-budget estimates (Tables 4 and 6; §4.1 “Note on dynamic sampling”).
Do experiments support the claims?
Competitiveness with simplicity: Yes within the shown comparisons: JustRL slightly exceeds ProRL-V2 on the DeepSeek backbone average (Table 3) and slightly exceeds QuestA on the Nemotron backbone average (Table 5).
Stability claim: The paper provides internal training dynamics for JustRL-DeepSeek-1.5B (Figure 2; §4.3) and qualitative “smooth curve” evidence in Figure 1 for both backbones.
Transfer without tuning: Supported by “same hyperparameters” statement (§3.1; §4.2) plus two-backbone results.
Caveat: The excerpt itself notes they “don’t have the computational resources to run extensive controlled comparisons” for some stability contrasts (§4.3), so stability superiority vs complex methods is argued more by absence of pathologies in their runs than by head-to-head controlled ablations against each trick.
Ablations / robustness checks.
Two ablations on JustRL-DeepSeek-1.5B (trained 3,000+ steps) (Figure 3; §4.4):
- Adding overlong penalty (explicit length penalty for last 4k tokens) reduces AIME24 plateau to ~50% vs ~55% baseline.
- Adding overlong penalty + robust verifier reduces plateau further to ~45%.
The entropy plot (Figure 3(b)) shows entropy collapsing to ~0.5–0.6 under these modifications, versus ~1.2–1.4 oscillation in the base recipe (§4.4).

6. Limitations and Trade-offs¶

Scope limitations (explicit).
Results are limited to mathematical reasoning tasks and two backbones at ~1.5B parameters; generalization to other domains (coding, general QA) and other scales is not explored (Limitations; §5).
The paper cannot definitively isolate which components are most critical (hyperparameters vs verifier vs training data vs interactions) (Limitations; §5).
Experimental/control limitations.
The paper highlights that it cannot run extensive controlled comparisons to isolate which complex techniques cause or solve instability (§4.3).
Negative ablations are limited to two modifications; many other interventions (curriculum learning, adaptive temperature, reference resets, alternative verifiers, data augmentation) are not explored (§4.4).
Compute accessibility trade-off.
Even though compute is lower than some baselines, the reported training uses 32 × A800-80GB for ~15 days, which may still be prohibitive for many researchers (Limitations; §3.1).
Potential trade-off implied by ablations: exploration vs constraint.
The ablations suggest that explicit length penalties and a more “robust” verifier can unintentionally reduce exploration (entropy collapse) and harm final performance (Figure 3; §4.4). This implies a delicate balance: constraints intended to stabilize or shape behavior may prematurely narrow the policy.

7. Implications and Future Directions¶

How this changes the field’s workflow (as suggested here).
The paper argues for a methodological shift: establish strong, simple baselines first, then add complexity only when it demonstrably solves a specific observed problem (§5–§6).
It suggests that some instability issues motivating complex pipelines may not be fundamental, but artifacts of certain design choices layered on top of each other (§5).
What follow-up research it enables/suggests.
Component attribution: controlled studies to disentangle which factor(s) are responsible for stability and gains (hyperparameters, verifier strictness, data distribution, rollout count) (Limitations; §5).
Generality tests: apply the same fixed recipe to other domains (coding) and other model sizes to see where simplicity breaks down (Limitations).
Longer-horizon scaling: test whether stability persists beyond the 3,000–4,380 step horizons reported (Limitations).
Practical applications / downstream use cases.
For practitioners with a distilled 1–2B math reasoning model that is plateauing, the paper provides a concrete “simple RL” baseline: train on DAPO-Math-17k with GRPO in veRL, binary verifier rewards, Rollout N=8, 16K context cap, and the fixed hyperparameters in Table 2 (§3.1–§3.2).
Repro/Integration Guidance (when to prefer this approach, per the paper’s evidence).
Prefer this recipe when:
- You want a low-complexity RL post-training baseline without multi-stage curricula, dynamic schedules, or data filtering (§3.1; Table 1).
- You care about stable training dynamics (monotonic improvement, stable entropy, natural response-length convergence) and want to avoid adding constraints that might collapse exploration (Figures 2–3; §4.3–§4.4).
Be cautious about adding:
- Explicit length penalties or swapping in a “robust verifier” without re-validating, because the provided ablations show these can reduce entropy and hurt performance in this setup (Figure 3; §4.4).