SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training¶

🎯 Pitch¶

This paper empirically compares supervised fine‑tuning (SFT) and reinforcement learning (RL) as post‑training recipes for foundation models, showing across two rule‑based environments (GeneralPoints and V‑IRL) that RL yields transferable, out‑of‑distribution generalization while SFT tends to memorize training rules. The finding matters because it identifies RL — especially with outcome‑based rewards and multi‑turn verification — as a practical route to robust, generalizable reasoning and visual capabilities, and clarifies that SFT’s primary value may be stabilizing output format for subsequent RL.

1. Executive Summary (2-3 sentences)¶

This paper compares supervised fine-tuning (SFT) versus reinforcement learning (RL) as post-training methods for foundation models, asking which one actually learns transferable rules versus memorizing training patterns. Across two rule-based reasoning environments—GeneralPoints (24-point arithmetic) and V-IRL (visual navigation)—the experiments show RL transfers gains to out-of-distribution (OOD) rule and visual variants, while SFT often improves in-distribution (ID) performance but collapses on OOD variants (Figures 5–7; Sections 5.1–5.2). The work also argues that SFT still plays a practical role by stabilizing output format so RL training can function at all (Section 5.4; Figure 9).

2. Context and Motivation¶

What specific problem or gap does this paper address?
Post-training recipes for foundation models commonly include SFT and/or RL, but it is unclear which method yields generalization (learning principles that transfer) versus memorization (reproducing training examples/rules) in rule-based reasoning tasks (Introduction, Section 1).
The paper’s key framing is to “separate data memorization from the acquisition of transferable principles” and test whether post-training learns rules rather than overfitting to the training rule specification (Section 1).
Why is this problem important?
If a post-training method improves ID accuracy but fails under slight rule or perceptual changes, it undermines robustness and reliability of AI systems (Section 1).
The paper emphasizes generalization in two axes that are common failure modes for modern models:
- Textual rule generalization: applying instructions/rules to a modified rule variant (Section 1; Section 4).
- Visual generalization (for VLMs): maintaining performance under shifts like color/layout or new cities/landmarks (Section 1; Sections 4.1–4.2).
What prior approaches existed, and where do they fall short (as positioned here)?
SFT is widely used for instruction tuning and formatting behaviors (Related Works “Post-training”; the paper references LIMA as a “format teacher” conceptually).
RL is widely used for alignment or task optimization, but comparative understanding of generalization vs memorization across both LLM and VLM settings is limited (Section 2; Section 1).
How does this paper position itself relative to existing work?
It positions itself as a comparative study of SFT vs RL on both unimodal (LLM-style text) and multimodal (VLM with images) tasks, using controlled rule variants and visual variants to diagnose transfer (Sections 1, 2, 4, 5).
It builds on a multi-turn RL with verification paradigm similar to RL4VLM / sequential revision (Section 3; Figures 2–3).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a multi-turn training-and-evaluation loop where a foundation model proposes an answer/action, a verifier scores it and provides feedback, and the model can revise its response over several iterations (Figures 2–3; Section 3).
It solves rule-based reasoning and navigation by combining (i) SFT to produce structured outputs and instruction following, and (ii) RL (PPO) to optimize for outcome-based reward under the verifier, with the goal that the learned behavior transfers to unseen rule/visual variants (Sections 3 and 5).

3.2 Big-picture architecture (diagram in words)¶

Component 1: Task environment (GeneralPoints or V-IRL) → provides the current “state” (text-only or image+text) (Section 4; Figures 2 and 4).
Component 2: Policy model (Llama-3.2-Vision-11B) → generates an output action/answer in a required JSON format (Section 5; Figures 11–14).
Component 3: Verifier (VER) → checks legality/correctness, emits a scalar reward and textual feedback (Section 3; Appendix A.3 and B.3 reward specs).
Component 4: Sequential revision prompt constructor → appends past model outputs and verifier messages to form the next prompt (Section 3; Figures 2–3).
Component 5: Trainer
SFT trainer trains on single-turn expert prompt-response pairs (Appendix C.1).
RL trainer uses PPO to maximize verifier reward over multi-turn episodes (Section 3).

3.3 Roadmap for the deep dive¶

First, define how the paper maps RL concepts onto LLM/VLM generation with a verifier (Section 3).
Second, explain the multi-turn “sequential revision” state transition and why it matters for training and evaluation (Figures 2–3).
Third, detail each evaluation environment and the two kinds of OOD shifts (rule vs visual) used to test generalization (Section 4).
Fourth, describe the SFT and RL training pipelines and what “compute scaling” means in the plots (Appendix C.2–C.3).
Finally, connect these mechanics to the observed outcomes (Figures 5–10; Appendix D).

3.4 Detailed, sentence-based technical breakdown¶

Paper type and core idea.
This is an empirical comparative study of post-training methods, with the core idea to hold the backbone model and tasks fixed while varying whether learning happens through supervised imitation (SFT) or reward optimization with multi-turn verification (RL), and then testing transfer to OOD rule/visual variants (Sections 1 and 5).
How RL is formulated for token-generating foundation models (mapping RL terms).
The paper defines the state space S as either pure text V^m (LLM case) or text+image V^m × O (VLM case), where V is the vocabulary and O is RGB image space (Section 3).
The action space A is an output token sequence V^n, i.e., the model’s generated response (Section 3).
A verifier VER: V^n → R × V^k produces (i) an outcome-based reward scalar r_t and (ii) textual verifier feedback v_ver^t after the model emits v_out^t (Section 3).
- “Outcome-based reward” here means the reward depends on whether the final produced answer/action is correct/valid (and sometimes partial failure categories), rather than supervising intermediate reasoning steps (Section 3; Appendix A.3, B.3).
Sequential revision: what happens first, second, third (information flow).
At time step t = 0, the model receives a system prompt v_in^0 that includes the task description, current state information (cards or navigation observation/history), rules, and a required JSON output schema (Figures 3, 11–14).
The model generates an output v_out^0 (e.g., a candidate equation for 24 points, or a navigation action) (Figures 2 and 13).
The verifier checks the output and returns (r_0, v_ver^0) such as “wrong calculation” / “illegal number used” / “correct answer” (Figure 2) or “Correct solution.” / “Incorrect action.” (Figures 13, 21).
For the next iteration, the new input prompt v_in^1 is formed by concatenating the original system prompt with the history of model outputs and verifier messages: v_in^t = concat(v_in^0, [v_out^k, v_ver^k]_{k=0}^{t-1}) (Section 3; Figures 2–3).
Steps 2–4 repeat until either success or a maximum number of verification steps is reached (Appendix A.3 sets 5 steps for GeneralPoints; Appendix B.3 sets 2 for V-IRL).
Training algorithms used.
The RL optimizer is PPO (Proximal Policy Optimization) (Section 3), treating the model π_θ as the policy.
The paper does not provide PPO-specific hyperparameters (e.g., clipping range, value loss coefficients, rollout lengths) in the provided content; only PPO is named (Section 3).
For SFT, the model is trained on supervised prompt-response pairs; the main-body SFT uses optimal single-turn expert responses and does not include multi-turn verification messages in the prompt (Appendix C.1).
An ablation adds “sub-optimal trajectories” that include errors and verifier messages during SFT, but it still does not improve OOD generalization (Appendix C.1; Figure 15).
Tasks and OOD designs (what generalization means operationally).
GeneralPoints (Section 4.1; Appendix A):
- State: 4 cards, either as text (GP-L) or as an image (GP-VL) (Section 4.1; Figure 2).
- Goal: output an equation equal to target (default 24) using each card’s numeric value exactly once (Section 4.1).
- Rule OOD shift: face cards J/Q/K are treated either as all 10 (ID training rule in Section 5.1) or as 11/12/13 (OOD test rule) (Section 4.1 “Rule variations”; Section 5.1).
- Visual OOD shift (VLM): train on black suits (♠, ♣) and test on red suits (♥, ♦) (Section 4.1 “Visual variations”; Section 5.2; Appendix A.3).
- Reward function: explicitly specified (Appendix A.3). For example:
- r = 5 for a legal equation equaling the target,
- r = -1 for a legal-but-wrong equation,
- penalties for illegal numbers/illegal equations,
- plus an extra r = -1.5 penalty in GeneralPoints-VL if card recognition is wrong (Appendix A.3).
V-IRL (Section 4.2; Appendix B):
- State: either a text-only description (V-IRL-L) or street-view image grid + text history (V-IRL-VL) (Section 4.2; Figures 13–14).
- Goal: follow route instructions to navigate to a destination; the model outputs a JSON containing “current observation”, selected “current instruction”, and an “action” (Figures 13–14).
- Rule OOD shift: switch the textual action space between:
- Absolute orientation: north, northeast, ... (ID)
- Relative orientation: left/right/slightly left/slightly right (OOD) (Section 4.2 “Rule variations”; Section 5.1; Figure 1 caption).
- Visual OOD shift: train on New York City routes and test on routes across multiple cities in the VLN mini benchmark (Section 5.2; Appendix B.1).
- Reward function: per-step reward r = 1 for correct action, r = -1 for wrong action, r = -1 for exceeding max verification step, and r = -1.5 for failed landmark detection (Appendix B.3).
Backbone model and hardware.
Backbone model: Llama-3.2-Vision-11B (Section 5).
Hardware: all training runs are on 8× NVIDIA H800 (80GB) in a single machine (Appendix C.2).
The paper does not specify several commonly needed model/training hyperparameters in the provided content, including:
- tokenizer details, context window length,
- number of layers / hidden size / attention heads (beyond “11B” parameter scale in the model name),
- SFT batch size, optimizer type (e.g., AdamW) and its β/ε, weight decay,
- learning-rate schedules for SFT or PPO,
- exact RL rollout lengths or KL penalties.
  Where learning rates are mentioned, they appear in ablations (Appendix D.1) rather than as a single canonical recipe.
Compute measurement and scaling methodology (what GFLOPs on plots means).
The plots scale training compute in GFLOPs and the paper estimates FLOPs using:
- X_train = 6 N D_train and X_inference = 2 N D_inference (Appendix C.3),
- with RL adding an extra inference term for on-policy buffer collection: X_RL = 6N(D_init + D_RL) + 2N D_buffer, and the buffer approximated by D_buffer ≈ λ D_RL (Appendix C.3).
The paper estimates λ ≈ 6 for GeneralPoints and λ ≈ 5.1 for V-IRL (Appendix C.3).
Line plots are smoothed with a Savitzky–Golay filter (polynomial order 3) and binomial standard error bars are approximated as sqrt(P(1-P)/N) (Appendix C.3).

4. Key Insights and Innovations¶

(1) RL transfers to unseen rules while SFT collapses on rule OOD.
The central finding is that RL improvements on ID rules tend to carry over to OOD rule variants in both text-only and vision-language settings, whereas SFT often overfits to the training rule specification (Section 5.1; Figures 5–6; Figure 1 as an illustrative navigation case).
What’s novel here is not PPO itself, but the controlled rule-variant evaluation across both arithmetic and navigation tasks under the same multi-turn verifier framework.
(2) RL also improves visual OOD generalization, not just textual rule generalization.
In the VLM settings, RL generalizes to visual distribution shifts (card suit color in GP-VL, new cities in V-IRL-VL), while SFT performance decreases (Section 5.2; Figure 7).
(3) Evidence that outcome-reward RL can improve visual recognition capability.
The paper measures card recognition accuracy in GP-VL and finds that scaling RL compute improves recognition accuracy alongside success rate, whereas scaling SFT degrades both (Section 5.3; Figure 8).
This is an important mechanistic claim: RL isn’t only learning a better “reasoning template”; it appears to improve an upstream perceptual subskill that bottlenecks task success (Section 5.3).
(4) SFT remains necessary as an “output format / instruction-following enabler” for RL.
Direct end-to-end RL from the base model fails because the model does not follow the required structured output, making reward extraction impossible (Section 5.4; Figure 9; failure example Figure 20).
This nuance is a key contribution to practical training design: even if RL generalizes better, SFT may be required as scaffolding.
(5) Verification iterations (inference-time compute) materially affect RL generalization.
Increasing the maximum number of verifier-guided revision steps improves OOD growth under fixed compute budgets (Section 5.5; Figure 10), tying the paper’s results to “inference-time compute scaling” ideas in the multi-turn RL setting.

5. Experimental Analysis¶

Evaluation methodology: datasets / environments / splits.
Two environments (Section 4):
- GeneralPoints: sampled 4-card hands from a standard 52-card deck, guaranteed solvable for target 24 via an expert solver (Appendix A.1). Variants:
- GP-L (text) and GP-VL (image).
- Rule shift: face cards as 10 (ID) vs 11/12/13 (OOD) (Sections 4.1, 5.1).
- Visual shift: black suits train vs red suits test (Sections 4.1, 5.2).
- V-IRL: navigation routes.
- Training database: 1000 unique routes from New York City (Appendix B.1).
- Visual OOD benchmark: 18 routes across 9 cities (2 routes/city), using the VLN mini benchmark (Appendix B.1).
- Rule shift: absolute vs relative action space (Section 4.2; Section 5.1).
Post-training backbone: Llama-3.2-Vision-11B (Section 5).
Pipeline: initialize with SFT, then separately scale compute for SFT and RL starting from that same initialized checkpoint (Section 5; Appendix C.2).
Metrics (what is reported where).
GeneralPoints: episode success rate (%) (Appendix C.3; Figures 5–7).
- Success is defined as succeeding at least once during inference-time verification (Appendix C.3).
V-IRL: two metrics appear:
- Per-step accuracy for V-IRL-VL in Figures 5–6 (Appendix C.3).
- Overall success rate (entire route correct) is shown separately (Appendix D.2; Figure 18) and is very low due to compounding errors.
Main quantitative results: rule OOD (RL up, SFT down).
Figure 6 (OOD under rule variants) gives explicit numbers comparing Init vs SFT vs RL (Section 5.1 summarizes them):
- GP-L OOD: RL improves 11.5% → 15.0% (+3.5%), while SFT drops 11.5% → 3.4% (−8.1%) (Figure 6; Section 5.1).
- V-IRL-L OOD: RL improves 80.8% → 91.8% (+11.0%), while SFT drops 80.8% → 1.3% (−79.5%) (Figure 6; Section 5.1).
- GP-VL OOD: RL improves 11.2% → 14.2% (+3.0%), while SFT drops 11.2% → 5.6% (−5.6%) (Figure 6; Section 5.1).
- V-IRL-VL OOD (per-step accuracy): RL improves 35.7% → 45.0% (+9.3%), while SFT drops 35.7% → 2.5% (−33.2%) (Figure 6; Section 5.1).
Figure 5 supports the broader trend across compute scaling: RL curves improve OOD across all four task variants, while SFT OOD degrades as compute increases (Figure 5, bottom row; Section 5.1).
Main quantitative results: visual OOD (RL up, SFT down).
Figure 7 reports visual OOD numbers:
- GP-VL visual OOD: RL improves 23.6% → 41.2% (+17.6%), while SFT decreases 23.6% → 13.7% (−9.9%) (Figure 7; Section 5.2).
- V-IRL-VL visual OOD (VLN mini benchmark success rate): RL improves 16.7% → 77.8% (+61.1%), while SFT decreases 16.7% → 11.1% (−5.6%) (Figure 7; Section 5.2).
The paper highlights this V-IRL result as a “byproduct” SOTA improvement on the mini benchmark: 44.0% → 77.8% (+33.8%) (Section 5.2).
- The comparison baseline “44.0%” is attributed to prior V-IRL work (Yang et al. 2024a) and is marked as “Previous SOTA” in Figure 7; the paper notes their approach is end-to-end RL on an open-source model rather than a two-stage closed-model pipeline (Section 5.2).
Ablations / diagnostic experiments.
Visual recognition vs success in GP-VL: Figure 8 shows recognition accuracy (y-axis) vs success rate (x-axis) for both ID and OOD, with RL trending upward in both and SFT trending downward (Section 5.3).
SFT without multi-turn data vs SFT with sub-optimal multi-turn trajectories:
- Appendix C.1 and Figure 15 show that even when SFT is trained on trajectories that resemble evaluation prompts (including errors and verifier messages), OOD still degrades, supporting the claim that memorization is not just a data-format mismatch.
RL without SFT initialization:
- Section 5.4 and Figure 9 show no improvement across trials; Figure 20 shows an example where outputs become long/unstructured, breaking reward extraction.
Verification-iteration scaling:
- Section 5.5 and Figure 10 show that more verification iterations {1,3,5,10} produce larger OOD growth; under fixed compute, the paper reports OOD improvements of +0.48% (1 step), +2.15% (3), +2.99% (5), +5.99% (10) (Section 5.5; Figure 10).
Do the experiments support the claims?
The claim “RL generalizes, SFT memorizes” is supported in the paper’s operational sense by consistent sign flips between RL vs SFT on OOD variants across:
- two tasks (GeneralPoints, V-IRL),
- two modalities (L, VL),
- two OOD types (rule shift, visual shift) (Figures 5–7).
However, the mechanism-level claim “SFT memorizes” is demonstrated behaviorally (OOD collapse) rather than via a direct memorization metric (e.g., retrieval of near-exact training examples). The paper defines memorization as generating near-exact copies of training examples when prompted (Section 1 footnote), but the provided content primarily shows performance collapse and rule fixation rather than explicit train-example regurgitation tests.

6. Limitations and Trade-offs¶

Missing or underspecified training details (reproducibility limits).
While PPO is specified (Section 3) and compute estimation is detailed (Appendix C.3), the provided content does not include many standard hyperparameters for SFT and RL (batch size, optimizer configs, PPO clip/rollout params, tokenization/context length). This limits exact reproduction from the paper text alone.
Dependence on a verifier and structured outputs.
The entire RL approach assumes a reliable VER that can score correctness and provide feedback text (Section 3; Figures 2–3).
When the model fails to follow the JSON schema, RL becomes infeasible because rewards cannot be computed (Section 5.4; Figure 20).
SFT failure on GP-VL (unresolved).
The paper flags an “abnormal phenomenon” where SFT fails to reach comparable ID performance to RL on GP-VL (Section 6).
Even after trying multiple learning rates and freezing variants (vision encoder, adapter), SFT does not show a strong increasing trend beyond ~30% success rate in the shown ablations (Figure 16; Section 6; Appendix D.1).
The paper hypothesizes SFT may “locally overfit to reasoning tokens while neglecting recognition tokens” due to token frequency imbalance, but leaves this as future work (Section 6).
RL has “corner case” failures: underfit or overfit initial checkpoints.
RL without SFT (under-instruction-following) fails (Section 5.4; Figure 9).
RL initialized from an “extremely overfitted” SFT checkpoint cannot recover OOD performance (Section 6; Appendix D.3; Figure 19 and failure example Figure 21).
This implies RL is not a universal fix; it needs an initialization in a “workable regime” (Section 6).
Metric trade-offs in V-IRL.
Per-step accuracy can improve while full-route success remains extremely low due to compounding errors on multi-step trajectories (Appendix D.2; Figure 18). This complicates interpretation of practical navigation success versus local decision quality.

7. Implications and Future Directions¶

How this work changes the landscape (within the provided evidence).
For rule-based tasks with clear success criteria, the paper provides evidence that outcome-reward RL with multi-turn verification is a stronger path to transfer under rule/visual shifts than scaling SFT compute (Figures 5–7).
It also suggests a concrete mechanism for multimodal gains: RL can improve perceptual subskills (card recognition accuracy) even when only outcome reward is used (Section 5.3; Figure 8).
Follow-up research suggested by the paper’s results.
Diagnose why SFT degrades visual recognition and whether data balancing or multi-objective losses could prevent the hypothesized “reasoning-token overfit” (Section 6; Figure 8; Figure 16).
Delineate conditions where SFT helps RL versus harms it (underfit vs overfit initialization regimes), since both extremes break RL effectiveness (Sections 5.4 and 6; Figures 9 and 19).
Study verifier design and verification-step scaling more systematically, since generalization depends strongly on the number of revision iterations (Section 5.5; Figure 10).
Practical applications / downstream use cases.
Tasks with:
- explicit correctness checking (math games like 24 points),
- executable constraints (legal equation checking),
- navigation with oracle trajectories / action checking, are good fits for the paper’s RL-with-verifier approach because rewards can be computed automatically (Appendix A.3; Appendix B.3).
Repro/Integration Guidance (when to use what, based on the paper).
Prefer SFT first when:
- the base model does not reliably follow task instructions or output schemas; SFT stabilizes format so RL can run (Section 5.4; Figure 9; Figure 20).
Prefer RL (multi-turn, outcome-reward, with verification) when:
- you care about OOD transfer across rule changes or visual shifts, since RL shows consistent OOD gains where SFT degrades (Sections 5.1–5.2; Figures 5–7).
Scale verification iterations at inference/training time if generalization is the target, since more iterations correlate with larger OOD improvements under fixed compute (Section 5.5; Figure 10).