Skip to content

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

ArXiv: 2501.12948

🎯 Pitch

This paper introduces DeepSeek-R1-Zero and DeepSeek-R1, demonstrating that large-scale reinforcement learning (including a “pure RL” setup without supervised fine-tuning) can elicit long, self-verifying chain-of-thought reasoning and dramatically boost performance on hard reasoning benchmarks. By combining RL-driven discovery of reasoning behaviors with a cold-start/SFT refinement and distillation into smaller models, the approach offers a practical, lower-cost pathway to substantially stronger reasoning capabilities across math, coding, and STEM tasks.


1. Executive Summary (2-3 sentences)

This paper develops DeepSeek-R1-Zero and DeepSeek-R1, reasoning-focused large language models whose reasoning ability is primarily improved through large-scale reinforcement learning (RL), including a “pure RL” setting without supervised fine-tuning (SFT) for DeepSeek-R1-Zero (§1, §2.2). It shows that rule-based rewards plus an RL algorithm (GRPO) can substantially increase performance on hard reasoning benchmarks (e.g., AIME 2024 pass@1 from 15.6%→71.0% for DeepSeek-R1-Zero) while also revealing practical issues like poor readability and language mixing that motivate a multi-stage pipeline for DeepSeek-R1 (§2.2.4, §2.3). The work further demonstrates that reasoning behaviors can be distilled into smaller dense models via SFT on a curated 800k-sample dataset, often outperforming applying RL directly to smaller base models (Table 5, Table 6, §4.1).

2. Context and Motivation

  • Problem / gap addressed.
  • The paper targets the open problem of effective test-time scaling for reasoning: getting models to improve on math, code, science, and logic by generating longer, more structured reasoning (Chain-of-Thought, CoT) at inference time (§1).
  • It also targets a training pipeline gap: many prior approaches rely heavily on large supervised datasets (SFT) to obtain reasoning behavior, which is expensive to collect (§2.1, §2.2).

  • Why it matters.

  • Reasoning improvements translate to better performance on high-stakes, well-defined tasks (math correctness, code correctness, STEM QA), and the paper frames post-training (RL/SFT after pretraining) as computationally cheaper than pretraining while still impactful (§1).
  • The paper also positions this as a step toward more general intelligent behavior, emphasizing autonomous “self-evolution” of reasoning through incentives (§1, §2.2.4).

  • Prior approaches and shortcomings (as characterized in the paper).

  • Approaches explored in the community include process-based reward models (step-level supervision), RL, and search methods like Monte Carlo Tree Search and beam search (§1; §4.2).
  • The paper asserts that, prior to this work, none of these methods yields general reasoning performance comparable to OpenAI’s o1 series (§1). (The paper does not provide a complete historical survey; it provides selected citations and later compares against o1 models in Tables 2 and 4.)

  • How this paper positions itself.

  • It explicitly tests whether reasoning can be incentivized through RL alone, without an SFT “cold start,” by training DeepSeek-R1-Zero from DeepSeek-V3-Base using GRPO and rule-based rewards (§1, §2.2).
  • It then builds a practical model, DeepSeek-R1, by adding a small amount of cold-start CoT SFT and a multi-stage RL/SFT pipeline to improve readability, reduce language mixing, and broaden general capabilities (§2.3).
  • Finally, it argues that distillation from a strong reasoning teacher is a particularly effective way to endow smaller models with reasoning, often beating RL-on-small-models for a given compute budget (§2.4, §4.1; Table 6).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a post-training pipeline that takes a pretrained base language model and makes it better at solving verifiable reasoning problems by optimizing it with reinforcement learning and carefully designed rewards.
  • It solves the problem of getting stronger reasoning and test-time scaling (longer, more deliberate CoT) via GRPO RL plus rule-based verification, and then improves usability via cold-start SFT, additional SFT data generation, and a second RL stage for broader alignment (§2.2–§2.3).

3.2 Big-picture architecture (diagram in words)

  • Base model (DeepSeek-V3-Base) → RL stage (reasoning) with GRPO + rule-based rewards → produces DeepSeek-R1-Zero (pure RL path) (§2.2).
  • For DeepSeek-R1: Cold-start SFT (thousands of readable long-CoT examples) → RL stage (reasoning) (rule-based accuracy + language consistency reward) → Rejection sampling to generate a large SFT dataset (reasoning + non-reasoning) → SFT retraining → RL stage (all scenarios) using mixed reward signals (rule-based for reasoning; preference reward models for general prompts) (§2.3.1–§2.3.4).
  • Distillation path: Use DeepSeek-R1-generated/curated 800k samples to SFT smaller base models (Qwen2.5 and Llama families) to produce DeepSeek-R1-Distill-* models (§2.4).

3.3 Roadmap for the deep dive

  • I first explain the RL algorithm (GRPO) because it is the optimization core (§2.2.1, Eq. (1)–(3)).
  • I then explain the reward signals and the training template, because rewards define what behavior is learned (§2.2.2–§2.2.3).
  • Next I walk through the DeepSeek-R1-Zero training outcome and emergent behaviors to ground what RL changes in practice (§2.2.4; Figure 2, Figure 3, Table 3).
  • Then I detail the four-stage DeepSeek-R1 pipeline that addresses usability and generality issues (§2.3.1–§2.3.4).
  • Finally, I cover distillation and the comparison between “RL on small models” vs “distill from a strong model” (§2.4, §4.1; Table 5, Table 6).

3.4 Detailed, sentence-based technical breakdown

This is an empirical post-training algorithm/system paper whose core idea is: optimize a base LLM with large-scale RL using verifiable, mostly rule-based rewards to induce strong reasoning behaviors, then refine with staged SFT and RL for readability and general utility (§2.2–§2.3).

System/data pipeline diagram in words (what happens first, second, third…)

  1. Start from a pretrained base model:
  2. The paper uses DeepSeek-V3-Base as the starting policy model for both DeepSeek-R1-Zero and DeepSeek-R1 (§1, §2.2).
  3. In the evaluation table, DeepSeek-V3 and DeepSeek-R1 are described as MoE models with 671B total parameters and 37B activated parameters (Table 4). The paper excerpt does not describe the internal MoE architecture beyond these counts.

  4. For DeepSeek-R1-Zero, apply RL directly (no SFT cold start):

  5. The training uses GRPO (Group Relative Policy Optimization) to update the model using sampled outputs and their rewards (§2.2.1).
  6. A fixed output template forces the model to emit a reasoning block and then an answer, with reasoning enclosed in <think>...</think> and the final answer in <answer>...</answer> (Table 1, §2.2.3).
  7. Rewards are computed by a rule-based reward system with two components (§2.2.2):
    • Accuracy reward: checks correctness using deterministic verification when possible (e.g., “final answer in a specified format (within a box)” for math; compiler + test cases for LeetCode-style coding).
    • Format reward: checks whether the output follows the required tag structure (<think> / </think>).
  8. The paper explicitly avoids neural reward models (outcome or process reward models) for DeepSeek-R1-Zero, citing risk of reward hacking and extra complexity/resources (§2.2.2).

  9. Understand GRPO precisely (the RL update rule):

  10. Plain-language paraphrase before notation: for each question, the algorithm samples multiple candidate answers, scores them, and then updates the model to increase the probability of better-than-average samples while keeping the model close to a reference distribution using a KL penalty (§2.2.1).
  11. The objective (Eq. (1)) is a PPO-like clipped policy gradient objective, but with a key twist: instead of learning a separate value-function “critic,” it uses the group’s reward statistics as a baseline, which saves compute (§2.2.1).
  12. Definitions grounded in Eq. (1)–(3) (§2.2.1):
    • Let q be a question sampled from P(Q).
    • Sample a group of G outputs {o_1, …, o_G} from the old policy Ď€_{θ_old}(O|q).
    • For each output o_i, compute a reward r_i.
    • Compute an advantage A_i using group normalization (Eq. (3)):
    • A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G).
    • Intuition: an answer gets positive advantage if it scores above the group average, negative if below, with scaling by variability.
    • Update the policy Ď€_θ by maximizing a clipped importance-weighted term (Eq. (1)) plus a KL penalty term:
    • ε is the clipping hyperparameter and β scales the KL penalty in Eq. (1).
    • The KL term is written in Eq. (2) as D_KL(Ď€_θ || Ď€_ref) with a specific estimator form.
  13. Micro-example (illustrating Eq. (3) without adding paper-unstated numbers):

    • Suppose for a given question you sample G candidate solutions, and two are correct (high reward) while the rest are incorrect (low reward).
    • The correct solutions’ rewards exceed the group mean, so their A_i are positive, and the update increases their probability.
    • Incorrect ones fall below the mean, giving negative A_i, decreasing their probability.
    • This implements “learn from comparisons within the sampled group” without a learned critic.
  14. Observe emergent behavior and performance for DeepSeek-R1-Zero:

  15. Performance improves steadily during RL on AIME 2024 (Figure 2, §2.2.4):
    • The paper reports pass@1 increasing from 15.6% to 71.0% after “thousands of RL steps” (§1, §2.2.4).
    • It also reports a majority-vote (self-consistency) score of 86.7% on AIME 2024 (Table 2, §2.2.4).
  16. The model’s average response length increases over training (Figure 3, §2.2.4), which the paper interprets as the model learning to spend more “thinking time” (more reasoning tokens) to solve harder problems.
  17. The paper highlights emergent behaviors such as reflection and exploring alternative solution paths, as well as an “aha moment” example where the model explicitly pauses and re-evaluates an earlier step (Table 3, §2.2.4).
  18. Practical issues remain: poor readability and language mixing are documented drawbacks of DeepSeek-R1-Zero (§2.2.4), motivating a more engineered pipeline.

  19. For DeepSeek-R1, add a four-stage pipeline to improve usability and breadth (§2.3):

  20. Stage 1: Cold-start SFT (thousands of examples) (§2.3.1).
    • The goal is to avoid the “early unstable cold start phase of RL training from the base model” and to seed more readable reasoning.
    • The paper describes multiple data-collection routes (few-shot prompting with long CoT, prompting for reflection/verification, reformatting R1-Zero outputs, and human post-processing), and states that it ultimately collects “thousands” of cold-start samples (§2.3.1).
    • It defines a more readable output pattern: |special_token|<reasoning_process>|special_token|<summary> where the summary condenses the reasoning results (§2.3.1).
  21. Stage 2: Reasoning-oriented RL (like R1-Zero) (§2.3.2).
    • It reuses the same large-scale RL process, focusing on reasoning-intensive tasks with clear solutions (coding, math, science, logic) (§2.3.2).
    • It adds a language consistency reward to reduce language mixing, computed as the proportion of target-language words in the CoT (§2.3.2).
    • The final reward is the sum of reasoning accuracy reward and language consistency reward (§2.3.2).
    • The paper notes an ablation finding qualitatively: alignment for language consistency slightly degrades performance but improves readability (§2.3.2). (The excerpt does not provide the ablation table/values.)
  22. Stage 3: Rejection sampling + SFT retraining (§2.3.3).
    • Rejection sampling here means: sample multiple model outputs per prompt and keep only those meeting acceptance criteria (e.g., correct answers, readable format).
    • Reasoning SFT data:
    • Generated from the converged RL checkpoint by sampling multiple trajectories and retaining correct ones (§2.3.3).
    • Expanded beyond purely rule-verifiable tasks by including some data judged by a “generative reward model” using DeepSeek-V3 as judge given ground truth and prediction (§2.3.3).
    • Filtering removes mixed-language CoT, long paragraphs, and code blocks because outputs can be chaotic (§2.3.3).
    • Total reasoning samples: ~600k (§2.3.3).
    • Non-reasoning SFT data:
    • Reuses parts of DeepSeek-V3 SFT dataset for writing, factual QA, self-cognition, translation, etc. (§2.3.3).
    • Sometimes uses DeepSeek-V3 to generate a potential CoT before answering; for trivial queries (e.g., “hello”) no CoT is provided (§2.3.3).
    • Total non-reasoning samples: ~200k (§2.3.3).
    • The paper then fine-tunes DeepSeek-V3-Base for 2 epochs on the combined ~800k dataset (§2.3.3).
  23. Stage 4: RL for all scenarios (helpfulness + harmlessness + reasoning) (§2.3.4).

    • For reasoning prompts, it continues using rule-based rewards as in R1-Zero (§2.3.4).
    • For general prompts, it uses reward models capturing human preferences, building on the DeepSeek-V3 pipeline with similar preference-pair distributions (§2.3.4).
    • A notable design choice is evaluation scope:
    • Helpfulness reward focuses exclusively on the final summary to avoid interfering with the underlying reasoning process (§2.3.4).
    • Harmlessness reward evaluates the entire response (reasoning + summary) to detect risky or harmful content (§2.3.4).
  24. Distillation into smaller models (SFT-only in this work) (§2.4):

  25. The paper directly fine-tunes several open-source base models on the same 800k samples curated with DeepSeek-R1 (§2.4).
  26. Listed base models include:
    • Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct (§2.4).
  27. For distilled models, it applies only SFT (no RL stage) to isolate the effect of distillation; it notes RL could further improve them but is left for the community (§2.4).

Core configurations and hyperparameters (only what is explicitly provided)

  • GRPO hyperparameters: the objective includes ε (clipping) and β (KL coefficient) (Eq. (1)), but the excerpt does not provide their numeric values, nor G (group size) (§2.2.1).
  • Generation/evaluation settings (§3, “Evaluation Setup”):
  • Maximum output length: 32,768 tokens per benchmark evaluation (§3).
  • Sampling-based evaluation:
    • Temperature: 0.6.
    • top-p: 0.95.
    • For each question, generate k responses, “typically between 4 and 64,” then compute pass@1 as the average correctness across samples (§3).
  • AIME 2024 also reports cons@64 (majority vote with 64 samples) (§3; Table 2, Table 5, Table 6).
  • Model size disclosure:
  • DeepSeek-R1 and DeepSeek-V3 are MoE with 671B total params and 37B activated params (Table 4).
  • Training compute / optimizer / batch size / LR schedule / context window / tokenizer / layer counts:
  • The provided excerpt does not include optimizer type/settings, learning rate schedule, batch size, number of layers, hidden size, attention heads, training tokens, hardware, or total RL steps for DeepSeek-R1 (it does mention “over 10K steps” for one RL-on-Qwen experiment in §4.1).
  • Because the instruction requires these when available, the absence should be treated as “not specified in the provided paper content.”

4. Key Insights and Innovations

  • (1) Pure RL can induce strong reasoning without SFT cold start (validated at scale).
  • What’s new: DeepSeek-R1-Zero applies RL directly on a base model without any supervised reasoning data (§1, §2.2).
  • Why it matters: It demonstrates large gains on reasoning benchmarks using only rule-based reward signals and a structured output format, e.g., AIME 2024 pass@1 15.6%→71.0% (§1; Figure 2).

  • (2) Group-based RL optimization (GRPO) as a practical alternative to critic-based RL.

  • What’s new: GRPO removes the need for a critic model by using within-group reward statistics as a baseline (§2.2.1, Eq. (3)).
  • Why it matters: The paper motivates this as saving RL training cost while still enabling steady performance improvements (Figure 2).

  • (3) Multi-stage “cold start → RL → data generation → SFT → RL” pipeline for usable reasoning models.

  • What’s new relative to R1-Zero: The four-stage pipeline targets readability, language consistency, and general capabilities while retaining reasoning strength (§2.3).
  • Why it matters: It directly addresses observed deployment-facing issues of pure RL (poor readability and language mixing) (§2.2.4, §2.3.1–§2.3.2).

  • (4) Distillation transfers reasoning patterns more effectively than RL on smaller models (in the paper’s comparisons).

  • What’s new: The paper shows SFT distillation from DeepSeek-R1 into smaller dense models yields large improvements and can outperform RL trained directly on a smaller base model (§2.4, §4.1).
  • Evidence of significance: Table 6 shows DeepSeek-R1-Distill-Qwen-32B strongly outperforming DeepSeek-R1-Zero-Qwen-32B trained with >10k RL steps on multiple benchmarks.

5. Experimental Analysis

Evaluation methodology (datasets, metrics, baselines, setup)

  • Benchmarks (§3):
  • Knowledge / QA: MMLU, MMLU-Redux, MMLU-Pro, GPQA Diamond, SimpleQA, FRAMES, plus several Chinese benchmarks (C-Eval, CMMLU, C-SimpleQA, etc.) (§3; Table 4 lists a subset).
  • Reasoning-heavy math: AIME 2024, MATH-500, CNMO 2024 (§3; Table 4).
  • Coding: LiveCodeBench (2024-08 to 2025-01), Codeforces, SWE-Bench Verified, Aider-Polyglot (§3; Table 4).
  • Open-ended generation judged by LLMs: AlpacaEval 2.0 and ArenaHard, using GPT-4-Turbo-1106 as judge; evaluation feeds only the final summary to reduce length bias (§3).

  • Metrics:

  • pass@1 for many tasks; for AIME also cons@64 majority-vote accuracy (§3; Table 2, Table 5, Table 6).
  • Codeforces: reported as percentile and rating (Table 4, Table 5).
  • SWE-Bench Verified: “Resolved” (Table 4).
  • AlpacaEval: “length-controlled winrate”; ArenaHard: win-rate with specified judge model (Table 4).

  • Baselines (§3, Table 4):

  • DeepSeek-V3, Claude-3.5-Sonnet-1022, GPT-4o-0513, OpenAI-o1-mini, OpenAI-o1-1217.
  • The paper notes o1-1217 performance is reported from “official reports” due to API access difficulty in mainland China (§3).

  • Evaluation decoding (§3):

  • Maximum generation length: 32,768 tokens.
  • To reduce variability and repetition under greedy decoding, the paper uses sampling with temperature 0.6 and top-p 0.95, generating k samples and computing pass@1 as the average correctness across samples (§3).
  • For AIME, it additionally computes cons@64 via majority vote (§3).

Main quantitative results (with specific numbers)

  • DeepSeek-R1-Zero vs o1 models (reasoning benchmarks) (Table 2):
  • AIME 2024: 71.0 pass@1, 86.7 cons@64.
  • MATH-500: 95.9 pass@1.
  • GPQA Diamond: 50.0 pass@1.
  • LiveCodeBench: 73.3 pass@1.
  • Codeforces rating: 1444.
  • Comparison points:

    • OpenAI-o1-0912: AIME pass@1 74.4, MATH-500 94.8, GPQA 63.4, LiveCodeBench 77.3, rating 1843.
    • OpenAI-o1-mini: AIME pass@1 63.6, MATH-500 90.0, GPQA 60.0, LiveCodeBench 60.0, rating 1820.
  • DeepSeek-R1 vs other representative models (Table 4):

  • Math:
    • AIME 2024 pass@1: 79.8 (vs o1-1217 79.2, o1-mini 63.6, DeepSeek-V3 39.2).
    • MATH-500 pass@1: 97.3 (vs o1-1217 96.4, o1-mini 90.0, DeepSeek-V3 90.2).
    • CNMO 2024 pass@1: 78.8 (no o1-1217 value shown in the table).
  • Coding:
    • LiveCodeBench pass@1-CoT: 65.9 (vs o1-1217 63.4, o1-mini 53.8).
    • Codeforces percentile: 96.3; rating 2029 (vs o1-1217 percentile 96.6, rating 2061).
    • SWE Verified resolved: 49.2 (vs o1-1217 48.9, DeepSeek-V3 42.0).
    • Aider-Polyglot acc: 53.3 (vs o1-1217 61.7).
  • Knowledge / QA:
    • MMLU pass@1: 90.8 (vs o1-1217 91.8, DeepSeek-V3 88.5).
    • GPQA Diamond pass@1: 71.5 (vs o1-1217 75.7, DeepSeek-V3 59.1).
    • SimpleQA correct: 30.1 (vs o1-1217 47.0, DeepSeek-V3 24.9).
    • FRAMES acc: 82.5 (vs DeepSeek-V3 73.3).
  • Open-ended eval:

    • AlpacaEval2.0 length-controlled winrate: 87.6 (vs DeepSeek-V3 70.0).
    • ArenaHard winrate: 92.3 (vs DeepSeek-V3 85.5).
  • Distilled models (Table 5):

  • DeepSeek-R1-Distill-Qwen-7B: AIME pass@1 55.5, MATH-500 92.8, GPQA 49.1, LiveCodeBench 37.6, Codeforces rating 1189.
  • DeepSeek-R1-Distill-Qwen-14B: AIME pass@1 69.7, MATH-500 93.9, GPQA 59.1, LiveCodeBench 53.1, rating 1481.
  • DeepSeek-R1-Distill-Qwen-32B: AIME pass@1 72.6, MATH-500 94.3, GPQA 62.1, LiveCodeBench 57.2, rating 1691.
  • Comparisons shown include QwQ-32B-Preview and OpenAI-o1-mini (Table 5).

  • Distillation vs RL on a 32B base (key evidence for the paper’s distillation claim) (Table 6, §4.1):

  • DeepSeek-R1-Zero-Qwen-32B (RL >10k steps) vs DeepSeek-R1-Distill-Qwen-32B:
    • AIME pass@1: 47.0 vs 72.6.
    • MATH-500 pass@1: 91.6 vs 94.3.
    • GPQA Diamond pass@1: 55.0 vs 62.1.
    • LiveCodeBench pass@1: 40.2 vs 57.2.

Do experiments support the claims?

  • Claim: Pure RL can strongly improve reasoning.
  • Supported by the reported AIME improvement trajectory and the large jump from 15.6% to 71.0% pass@1 for DeepSeek-R1-Zero (Figure 2, §2.2.4, §1), plus cross-benchmark comparisons in Table 2.
  • The evidence is strongest on tasks with verifiable correctness and rule-based rewards (math/code), matching the training reward design (§2.2.2).

  • Claim: DeepSeek-R1 reaches parity with OpenAI-o1-1217 on reasoning tasks.

  • Table 4 shows close performance on AIME 2024 (79.8 vs 79.2) and MATH-500 (97.3 vs 96.4), and near parity on Codeforces (2029 vs 2061 rating).
  • However, the paper also shows gaps on some benchmarks (e.g., GPQA 71.5 vs 75.7; SimpleQA 30.1 vs 47.0; Aider-Polyglot 53.3 vs 61.7) (Table 4).

  • Claim: Distillation is more effective than RL for small models (as tested here).

  • Table 6 directly supports this for Qwen-32B: RL-trained DeepSeek-R1-Zero-Qwen-32B underperforms the distilled DeepSeek-R1-Distill-Qwen-32B on all listed benchmarks (§4.1).

Ablations, failure cases, robustness checks

  • Language consistency reward ablation:
  • The paper mentions that adding language consistency reward slightly degrades performance but improves readability (§2.3.2). The excerpt does not include numeric ablation results.

  • Failure modes / unsuccessful approaches (§4.2):

  • The paper reports unsuccessful attempts using a PRM (Process Reward Model) at scale due to step definition difficulty, intermediate correctness labeling difficulty, and reward hacking/complexity.
  • It also reports difficulties scaling MCTS for token generation due to enormous search space and difficulty training a fine-grained value model.

6. Limitations and Trade-offs

  • Readability and language mixing as core trade-offs of pure RL (R1-Zero).
  • DeepSeek-R1-Zero exhibits poor readability and language mixing (§2.2.4).
  • Fixing language mixing via a language consistency reward can slightly reduce raw performance, indicating a usability–accuracy trade-off (§2.3.2).

  • Reliance on verifiable tasks for clean rule-based rewards.

  • The strongest RL signal comes from tasks where correctness can be checked reliably (math with deterministic answers; code with compilation/test cases) (§2.2.2, §2.3.2).
  • For broader “all scenarios” alignment, the pipeline must introduce preference reward models (§2.3.4), which reintroduces complexity the pure rule-based R1-Zero setting avoided (§2.2.2).

  • General capability gaps relative to the base model in some areas.

  • The paper states DeepSeek-R1 currently falls short of DeepSeek-V3 on function calling, multi-turn, complex role-playing, and JSON output (§5).

  • Prompt sensitivity.

  • The paper observes that few-shot prompting degrades DeepSeek-R1 performance and recommends zero-shot with explicit formatting instructions (§5).

  • Software engineering RL is limited by evaluation latency.

  • The paper attributes modest improvements over DeepSeek-V3 on some software engineering benchmarks to limited RL data and long evaluation times, which slow RL iteration (§3.1 discussion; §5).

  • Safety alignment can reduce answer rates / accuracy on some benchmarks.

  • On Chinese SimpleQA, the paper attributes worse performance to refusals after safety RL and notes that without safety RL accuracy could exceed 70% (§3.1 discussion under Table 4). This is a clear alignment–coverage trade-off.

  • Missing reproducibility-critical training details in the provided excerpt.

  • The excerpt does not specify optimizer, learning-rate schedule, batch size, training tokens, hardware, or the numeric values of key RL hyperparameters (ε, β, G) (Eq. (1)–(3)). This limits the ability to independently reproduce training exactly from the provided content.

7. Implications and Future Directions

  • How this changes the landscape (based on the paper’s evidence).
  • The results suggest that large-scale RL with carefully chosen, mostly rule-based incentives can drive substantial improvements in reasoning behaviors (longer CoT, self-correction-like reflection) without supervised reasoning data in the initial stage (Figure 2, Figure 3, Table 3, Table 2).
  • The distillation results imply that once a strong reasoning model exists, SFT distillation becomes a practical path to distribute reasoning ability into smaller dense models (Table 5), and can outperform doing RL from scratch on those smaller models (Table 6, §4.1).

  • Follow-up research suggested by the paper’s findings.

  • Better multilingual consistency: the paper explicitly flags language mixing outside Chinese/English and proposes future work to address it (§5).
  • General capability integration: improve function calling, multi-turn interaction, role-playing, and structured outputs while leveraging long CoT (§5).
  • More efficient software-engineering RL: use rejection sampling on software engineering data or asynchronous evaluation to reduce RL wall-clock time (§5).
  • Community exploration of RL on distilled models: the paper notes RL on distilled models can yield further gains but is not included in this work (§3.2 discussion; §2.4).

  • Practical applications / downstream use cases (grounded in evaluated tasks).

  • Stronger performance on math (AIME, MATH-500) and competitive programming (Codeforces, LiveCodeBench) suggests use in education/tutoring for problem solving, algorithmic coding assistance, and STEM QA where correctness matters (Table 4).
  • Improved long-context QA on FRAMES suggests potential value in document analysis and “AI-driven search and data analysis tasks” as the paper notes (Table 4 discussion in §3.1).

  • Repro/Integration Guidance (from the paper’s explicit recommendations and setup).

  • Prompting: Prefer zero-shot prompts that directly describe the task and specify output format; avoid few-shot prompting, which the paper reports as harmful for DeepSeek-R1 (§5).
  • Evaluation / deployment decoding: The paper’s evaluation avoids greedy decoding for long-output reasoning due to repetition/variance and instead uses sampling (temperature 0.6, top-p 0.95) with multiple samples and optionally majority vote (§3). If you want stability for reasoning tasks, the paper’s own methodology suggests using multi-sample strategies rather than single greedy outputs.
  • When to prefer distillation vs RL (as supported by Table 6):
    • If you have access to a strong reasoning teacher and want a smaller model, the paper’s evidence indicates SFT distillation is more compute-efficient and yields better results than large-scale RL on the smaller base model (Table 6, §4.1).
    • If your goal is to push beyond the teacher’s capabilities, the paper argues that this may still require more powerful base models and larger-scale RL (§4.1).