Training Language Models to Self-Correct via Reinforcement Learning¶

🎯 Pitch¶

This paper introduces SCoRe, a novel two-stage multi-turn reinforcement learning approach that enables a single large language model to intrinsically self-correct—detect and fix its own reasoning or coding mistakes—using only self-generated data, without external feedback or auxiliary models. By tackling the key obstacles of distribution shift and behavior collapse that hampered previous methods, SCoRe achieves the first robust, significant improvements in self-correction on challenging benchmarks like MATH and HumanEval. This unlocks practical gains in reliability and autonomy for language models, marking a vital step towards models that can audit and refine their own outputs—an essential meta-reasoning ability for advanced AI.

1. Executive Summary¶

This paper introduces SCoRe, a two‑stage, multi‑turn reinforcement learning (RL) method that trains a single language model to detect and fix its own mistakes (“intrinsic self‑correction”) using only self‑generated data. On MATH and HumanEval, SCoRe delivers the first clearly positive self‑correction gains without external feedback or auxiliary models—e.g., on MATH it raises final‑turn accuracy to 64.4% with a +4.4% improvement from turn 1 to turn 2 (Table 2), and on HumanEval it achieves a +12.2% improvement (Table 3).

2. Context and Motivation¶

Problem addressed
Intrinsic self‑correction: improving an answer in a second pass without any external feedback (no labels, tools, or teacher model at test time). Section 1 and Figure 2 show that contemporary LLMs rarely improve their own answers; they often either make superficial changes or turn correct answers into incorrect ones.
Why it matters
Practical: Many math and coding tasks require catching an earlier mistake and revising reasoning or code. Enabling a single model to reliably improve its own output can reduce the need for multiple model calls, human feedback, or external verifiers at inference time.
Scientific: Self‑correction is a concrete instance of meta‑reasoning—using test‑time computation to implement strategies like “audit and revise.” Section 1 argues current models often contain the needed knowledge but fail to elicit it during revision.
Prior approaches and their limits (Sections 2 and 4)
Prompting for self‑correction often fails in the intrinsic setting (no oracle signals), and can even hurt performance.
Supervised fine‑tuning (SFT) on revision traces—e.g., STaR or pairing incorrect with correct answers—improves first‑turn accuracy but yields little or negative improvement from turn 1 to turn 2. Table 1 shows small or negative deltas Δ(t1, t2) on MATH even after fine‑tuning.
Multi‑model systems (separate “refiner” or “teacher” models) work but complicate deployment. This paper aims for a single‑model solution trained only on its own rollouts.
The core gaps identified (Sections 4 and 5)
Distribution shift: Offline SFT learns to fix base‑model mistakes but fails when faced with its own different mistakes at test time (Figure 5).
Behavior collapse: The learner converges to “give best first answer and don’t change it,” producing little or no improvement on the second attempt (Figures 4 and 6).

3. Technical Approach¶

SCoRe (Self‑Correction via Reinforcement Learning) trains a single policy over two sequential attempts per problem. Notation: - Accuracy@t1 and Accuracy@t2: correctness at the first and second attempt, respectively (Section 3). - Δ(t1, t2): improvement from attempt 1 to attempt 2. - Δ_i→c: fraction of problems that go from incorrect to correct across attempts; Δ_c→i: fraction that flip from correct to incorrect (Section 3).

High‑level objective (Eq. 1) - The model produces two responses per input. The training goal is to maximize the sum of rewards across both attempts, where reward is given only by an “oracle checker” during training (e.g., exact answer on MATH or pass/fail test cases for coding). No checker is used at inference.

Base RL machinery (Eq. 2) - Policy gradient (REINFORCE) with a KL penalty to a fixed reference model (π_ref) to prevent the policy from drifting too far from a good prior. “On‑policy” means the model is trained on rollouts generated by the current policy, avoiding the offline distribution shift seen in SFT.

Two‑stage training to avoid collapse (Figure 7) 1) Stage I: Decouple the two attempts - Train only the second attempt to maximize reward while constraining the first attempt to remain close to the base model via a strong KL penalty on turn 1 (Eq. 3, β2 term). - Intuition: If the model is free to make its first attempt as good as possible, it will learn to “solve in one shot,” leaving nothing to correct. By “freezing” first‑turn behavior near the base model, Stage I forces learning how to improve given imperfect first answers. Figure 6b shows this increases the frequency of proposing a different second answer (i.e., prevents “don’t change” collapse).

2) Stage II: Joint optimization with progress‑based reward shaping - Now train both attempts with standard RL (Eq. 4) plus a “progress bonus” on turn 2: b̂(y2|y1, y*) = α · (r̂(y2) − r̂(y1)). - Intuition: Reward shaping explicitly favors transitions that fix mistakes and heavily penalizes turning a correct answer into an incorrect one. This biases learning towards the self‑correction strategy rather than “solve once and copy.”

Additional implementation details (Section 5.3; Appendix B) - On‑policy sampling is primary, with optional inclusion of base‑model first attempts to broaden state coverage. - Hyperparameters include KL weights (β1, β2) and the progress multiplier α (Table 5). - REINFORCE with KL to a reference model; greedy decoding for evaluation except in inference‑compute scaling (Section 6.2).

Why this design over alternatives (Sections 4–5) - Offline SFT fails due to distribution shift and collapse (Table 1, Figures 4–5). On‑policy RL fixes the shift but still collapses (Figure 6) unless the two stages are used. - Reward discounting alone (γ > 0) does not prevent collapse (Appendix A.2, Figure 9). The explicit progress bonus is needed.

A simple example to build intuition - Suppose turn 1 answers “21” to a math question whose true answer is “24.” Stage I teaches the model to notice and fix such near misses while keeping turn‑1 behavior close to the base model. Stage II then reinforces the specific transition “21 → 24,” granting extra reward for correcting and penalizing “24 → 21.”

4. Key Insights and Innovations¶

Two‑stage RL to learn the self‑correction “meta‑strategy” (Sections 5.1–5.2; Figure 7)
Novelty: Trains the model to improve over its own prior attempt by explicitly decoupling attempts (Stage I) and then jointly optimizing with a progress‑based bonus (Stage II).
Significance: Avoids the twin pitfalls identified in Section 4—distribution shift and behavior collapse—where both SFT and naïve multi‑turn RL fail (Figures 5–6; Table 1).
Progress‑based reward shaping that directly encodes “fix mistakes” (Section 5.2)
Different from prior RLHF style objectives that only reward final correctness. The bonus α·(r2 − r1) encourages “incorrect → correct” flips and discourages “correct → incorrect.”
Impact: Improves Δ(t1, t2) substantially, as seen in ablations (Table 4, “w/o reward shaping”).
Evidence‑backed diagnosis of SFT failure modes for self‑correction (Section 4)
New analyses show SFT models learn to make minimal edits (Figure 4a) and generalize poorly from fixed offline traces to self‑generated first attempts (Figure 5).
Compute‑efficient inference strategy: mix parallel sampling with sequential self‑correction (Section 6.2; Figure 1 right)
Observation: For a fixed sampling budget, doing fewer parallel samples and then self‑correcting each yields better final consistency than only parallel majority voting.

These are more than incremental tweaks: Stage I + progress shaping change what is being learned (a meta‑strategy to improve) rather than only making first attempts better.

5. Experimental Analysis¶

Evaluation setup (Section 6)
Tasks and data:
- MATH (Hendrycks et al., 2021): reasoning with a verifier; train/test split follows Lightman et al. (2023). Main reporting on a 500‑problem test subset (“MATH500”).
- Coding: MBPP for training and HumanEval for evaluation; MBPP‑R (repair) for offline code‑fix tests (Table 3).
Models: Gemini 1.5 Flash (math) and Gemini 1.0 Pro (code). Greedy decoding at eval except Section 6.2.
Metrics (Section 3): Accuracy@t1, Accuracy@t2, Δ(t1, t2), Δ_i→c, Δ_c→i.
Baselines: Prompting‑based Self‑Refine; SFT variants: STaR and Pair‑SFT (Section 6).
Main quantitative results
MATH (Table 2; Figure 1 left): > Base: Accuracy@t1 52.6%, Accuracy@t2 41.4%, Δ = −11.2% (self‑correction hurts).
> SCoRe: Accuracy@t1 60.0%, Accuracy@t2 64.4%, Δ = +4.4%, Δ_i→c = 5.8%, Δ_c→i = 1.4%.
- Compared to Pair‑SFT (Δ = +1.8%) and STaR+ (Δ = +0.4%), SCoRe shows a clearly positive and larger self‑correction gain while also raising both turn accuracies.
Coding (Table 3): > MBPP‑R (repair): Base 47.3% → SCoRe 60.6%.
> HumanEval: Base Accuracy@t2 56.7%, Δ = +3.0%; SCoRe Accuracy@t2 64.6%, Δ = +12.2%.
> Pair‑SFT degrades self‑correction on HumanEval (Δ = −1.8%).
- This shows that training only on MBPP generalizes to HumanEval and that on‑policy training is key for live self‑correction (Pair‑SFT excels at static repair but not at self‑correction).
Do the experiments support the claims?
Yes, through targeted diagnostics and ablations:
- SFT failure modes: Table 1 (small/negative Δ), Figures 4–5 (minimal edits, distribution shift).
- Naïve multi‑turn RL collapses: Figure 6 (low frequency of changing the answer without Stage I).
- Necessity of each component: Table 4 shows that removing multi‑turn training, Stage I, or reward shaping reduces or reverses the self‑correction gain. Replacing Stage II RL with STaR also underperforms.
- Inference compute scaling: Figure 1 (right) shows the sequential self‑correction budget is more effective than purely parallel sampling at the same total sample count.
- Additional robustness: Appendix A.1 shows SCoRe maintains mild gains beyond two attempts, while baselines do not improve after turn 2.
Qualitative behaviors (Figure 2; Appendix E)
SCoRe fixes arithmetic mistakes and reasoning errors, sometimes rewriting entire solutions or selectively repairing flawed steps, which aligns with the intended “audit and revise” behavior.

6. Limitations and Trade-offs¶

Assumptions and scope (Sections 3 and 7)
Requires a training‑time “oracle reward” (r̂) to check final correctness (e.g., exact answers or unit tests). No such oracle is used at inference, but training availability is assumed.
Trained and evaluated primarily for two attempts. Appendix A.1 shows slight gains beyond two, but robust multi‑turn scaling is untested.
Reward is mostly binary (correct/incorrect). Complex tasks where partial progress should be graded may need more nuanced reward design.
Computational and data considerations
RL fine‑tuning with on‑policy sampling is more compute‑intensive than SFT. However, the paper keeps comparable training budgets across methods (Section 6, “Models/Evaluation protocol”) and reports that RL achieves better self‑correction.
Hyperparameter sensitivity: SCoRe relies on several regularization weights (β1, β2) and the progress multiplier (α). The paper provides working values (Table 5) but not an extensive sensitivity analysis.
Generality and external validity
Results are on math and code tasks with reliable verifiers. Applicability to open‑ended tasks (e.g., long‑form writing) is unclear without designing suitable training rewards.
Models are Gemini variants; transfer to other model families or open‑weights models is not tested here.
Failure modes and edge cases
If turn‑1 outputs drift too far, Stage II may still risk collapse without strong enough decoupling or shaping. The method depends on balancing exploration (change answers) and stability (avoid c→i flips).

7. Implications and Future Directions¶

Field impact
Establishes that intrinsic self‑correction can be trained effectively with a single model and self‑generated data, provided training is on‑policy and explicitly rewards “making progress” across attempts. This reframes self‑correction as a multi‑turn RL problem with a meta‑strategy objective rather than a pure SFT problem.
Practical applications
Math solvers and code assistants that revise their own reasoning/code without external tools.
Compute‑efficient inference: combine a few diverse samples with sequential self‑correction for better accuracy at fixed budget (Figure 1 right).
Follow‑up research
Multi‑turn scaling: extend beyond two attempts with RL (Section 7 notes this as a limitation), potentially learning longer self‑improvement curricula.
Richer progress signals: step‑level or process‑based rewards (e.g., verifier on intermediate reasoning) rather than only final correctness.
Broader domains: apply SCoRe to tool use, planning, or long‑form generation by crafting domain‑appropriate training rewards.
Unification of stages: investigate single‑stage objectives that implicitly enforce decoupling and progress, reducing complexity (Section 7).
Robustness and safety: study when self‑correction wrongly overwrites correct answers and how to calibrate the trade‑off between Δ_i→c and Δ_c→i.

In short, SCoRe shows that teaching models “how to improve their own answers” benefits from RL on self‑generated traces plus carefully designed regularization that makes learning the meta‑strategy strictly preferable to the degenerate “don’t change” solution. The empirical evidence across math and coding (Tables 2–3, Figure 1) and the ablations (Table 4) make a strong case for this paradigm.