Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models¶
ArXiv: 2603.13985
Pitch¶
This comprehensive survey unifies two dominant LLM post-training paradigms—Supervised Fine-Tuning and Reinforcement Learning—revealing they are deeply connected rather than disjoint methodologies. Through a systematic review of literature from 2023–2025, the authors document the field's rapid shift toward hybrid training pipelines and provide practitioners with principled guidance on when and how to leverage each approach for optimal performance.
1. Executive Summary¶
This survey paper provides the first systematic comparison and unification of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) as post-training paradigms for Large Language Models (LLMs). The authors demonstrate that SFT can be mathematically reformulated as a special case of RL through a gradient equivalence argument, and they comprehensively review methods from 2023–2025 showing a rapid field-wide shift toward hybrid SFT-RL pipelines (from 20% to ~74% of studies) and from proprietary API-based labeling to open-weight model-generated datasets. The paper organizes the literature through a taxonomy (Figure 1) covering algorithm-centric and data-centric innovations, unified training objectives, and applications across four domains (general QA, mathematical reasoning, agentic tasks, and code generation).
2. Context and Motivation¶
The Core Problem: SFT and RL Are Treated as Disjoint Methods¶
Pre-trained LLMs—from models like PaLM to LLaMA—require post-training adaptation to improve accuracy on specific tasks, mitigate erroneous outputs, and handle new capabilities. Two dominant paradigms exist: Supervised Fine-Tuning (SFT), which trains on high-quality prompt-response pairs using standard language modeling objectives, and Reinforcement Learning (RL), which optimizes behaviors according to reward signals derived from human or automated preference feedback.
Despite both being essential post-training tools, prior work examined them separately. As the authors note:
"the majority of them typically examine SFT or RL separately... leaving the relationships between these approaches comparatively underexplored."
This fragmentation created several practical problems: - Practitioners lacked principled guidance on when to use SFT versus RL versus both - Techniques developed for one paradigm weren't being transferred to the other, despite potential applicability - The emerging trend toward hybrid methods (combining SFT and RL) lacked a coherent theoretical foundation
Why This Problem Matters¶
The importance stems from both practical and theoretical dimensions:
Practical impact. Post-training determines whether an LLM succeeds or fails at real-world deployment. For example: - Chain-of-Thought (CoT) reasoning requires post-training to generate progressively longer reasoning chains - Agentic tasks (household robots, device control) require post-training for interaction skills rarely present in pre-training data - Hallucination mitigation depends on post-training to recognize uncertainty
Theoretical significance. Understanding the relationship between SFT and RL opens the door to: - Transferring stability techniques from SFT to RL (mitigating entropy collapse and reward hacking) - Transferring exploration techniques from RL to SFT (addressing distribution shift and compounding errors) - Designing hybrid methods that optimally balance imitation and exploration
Prior Approaches and Their Limitations¶
The paper identifies several prior research streams that each addressed only part of the picture:
SFT-only surveys (Parthasarathy et al., 2024; Tao et al., 2024; Mao et al., 2025) focused on fine-tuning techniques without considering RL alternatives. SFT, while stable and simple to implement, suffers from: - Distribution shift: The model is trained on expert demonstrations but must perform on its own generated outputs at test time - Compounding errors: Following the Behavior Cloning (BC) literature (Ross and Bagnell, 2010; De Haan et al., 2019), small mistakes early in generation cascade into larger failures - Limited exploration: SFT cannot discover behaviors absent from the demonstration data
RL-only surveys (Tie et al., 2025; Zhang et al., 2025c) explored reward optimization without considering SFT foundations. RL, while enabling exploration, suffers from: - Entropy collapse: The policy can become deterministic, losing diversity - Reward hacking: The model finds high-reward behaviors that exploit the reward model rather than achieving the intended goal - Sample inefficiency: Pure RL requires extensive exploration without expert guidance
Specialized surveys focused on narrow dimensions—vision-centric adaptation (Chu et al., 2025), reasoning (Kumar et al., 2025), agentic behaviors (Du et al., 2025a), or scaling strategies (Lai et al., 2025)—but lacked a unified perspective on SFT and RL together.
Sequential pipelines (the RLHF paradigm from Ouyang et al., 2022; Bai et al., 2022) applied SFT first to "inject prior general knowledge," then RL to "promote the capability in a particular aspect." However, as the authors note:
"there is a missing systematic comparison and combination between these two stages, especially from a methodological and theoretical perspective."
How This Paper Positions Itself¶
The paper positions itself as providing three contributions (explicitly stated in the Introduction):
- First systematic comparison: "We are the first study to systematically summarize and compare SFT and RL in LLM post-training"
- Unified framework: "establish a unified framework for characterizing SFT and RL, highlighting how they can complement each other or be integrated into hybrid learning approaches"
- Trend analysis: "observe rapid task-domain expansion, growing adoption of integrated SFT–RL training, and a continued shift from API-based labeling to open-weight–generated datasets"
The paper uses a taxonomy (Figure 1) organizing prior work into three categories: - Algorithm-centric versus data-centric approaches within SFT and RL - Comparative, unifying, and hybrid frameworks that integrate SFT and RL objectives - Representative downstream application domains
This positions the work as a "horizontal" survey that spans across the field, rather than a "vertical" deep-dive into any single method.
3. Technical Approach¶
3.1 Reader Orientation¶
This is a survey paper that provides a systematic review and theoretical unification of SFT and RL post-training methods, rather than proposing a single new algorithm. The core technical contribution is demonstrating that SFT can be viewed as a special case of RL through a gradient equivalence argument, enabling techniques to transfer between paradigms.
3.2 Big-Picture Architecture¶
The paper's analytical framework consists of four major components:
-
Taxonomy of Methods (Figure 1): Organizes existing literature into algorithm-centric vs. data-centric approaches for both SFT and RL, then into unifying frameworks and hybrid training methods.
-
Unified Objective Framework (Section 4.1): A mathematical reformulation showing that both SFT and RL optimize variants of the same underlying objective—maximizing expected reward with KL regularization.
-
Cross-Enhancement Analysis (Sections 4.2–4.3): Methods that use SFT to improve RL (e.g., providing demonstration data, modifying objectives) and vice versa (e.g., applying RL perspective to rescale SFT loss).
-
Application Domain Review (Section 5): Four domains—general QA, mathematical tasks, agentic tasks, and code-based tasks—with domain-specific challenges and methodological adaptations.
3.3 Roadmap for the Deep Dive¶
We explain the technical content in the following order:
- First, the mathematical foundations of SFT and RL objectives separately, establishing notation.
- Second, the unified objective framework showing how SFT can be viewed as RL with an indicator reward.
- Third, algorithm-centric innovations in SFT (entropy regularization, token cleaning, policy-gradient variants).
- Fourth, algorithm-centric innovations in RL (policy optimization variants, entropy regularization strategies).
- Fifth, data-centric innovations in both paradigms (selection strategies, curriculum learning).
- Sixth, hybrid methods that combine SFT and RL through unified objectives or interleaved training.
- Seventh, the application domains and their domain-specific methodological patterns.
3.4 Detailed, Sentence-Based Technical Breakdown¶
Mathematical Foundations: SFT and RL Objectives¶
Supervised Fine-Tuning (SFT) minimizes the negative log-likelihood of target responses given prompts. Given a dataset \(D = \{(x, y)\}\) of prompt-response pairs, SFT optimizes:
where \(\pi_\theta\) is the language model (policy) parameterized by \(\theta\). The gradient is:
This is standard maximum likelihood estimation: push probability mass toward expert demonstrations. The authors explicitly connect this to Behavior Cloning (BC) from the RL literature (Pomerleau, 1991), noting that "both frameworks learn directly from expert demonstrations without relying on explicit reward signals."
Reinforcement Learning (RL) maximizes expected reward from a reward function \(r\). The objective is:
The gradient uses the policy gradient theorem:
This is the REINFORCE estimator: increase the probability of actions that receive high reward. The key distinction is that SFT trains on fixed expert data \((x, y) \sim D\), while RL trains on samples \(y \sim \pi_\theta\) from the current policy.
The Unified Objective Framework¶
The central theoretical insight appears in Section 4.1. The authors show that the SFT gradient can be reformulated as an RL gradient with a specific reward structure:
where \(I\{y = y_i\}\) is an indicator function that equals 1 when the sampled output exactly matches the ground truth. The term \(-\frac{I\{y = y_i\}}{\pi_\theta(y | x)}\) serves as a proxy reward function: it assigns high reward when the policy matches the expert, scaled inversely by the probability (higher reward for matching low-probability expert actions).
This leads to a unified view: both SFT and RL can be seen as optimizing:
where \(\pi_0\) is a reference policy (typically the pre-trained model), \(r\) is a reward proxy, and \(\beta\) controls KL regularization to prevent deviation from the base model.
Practical implications of unification: - Techniques for RL (importance sampling, online rollouts) can address SFT's generalization limitations - Techniques for SFT (memorization from expert data) can be integrated into RL objectives - The two stages should be viewed as "mutually reinforcing and interdependent" rather than sequential alternatives
Algorithm-Centric SFT Innovations¶
The paper surveys three algorithm-centric modifications to standard SFT (Section 3.1):
Entropic Distribution Matching (GEM) (Li et al., 2024c) reformulates SFT as a distribution-matching problem with entropy regularization. Standard SFT can overfit to specific token sequences, reducing output diversity. GEM adds an entropy term that encourages the policy to maintain higher uncertainty over outputs, preventing the model from collapsing to deterministic predictions.
Token Cleaning (Pang et al., 2025) addresses noise in SFT data. Not all tokens contribute equally to learning—some are uninformative or misleading. Token Cleaning estimates each token's contribution to model updates and removes low-value tokens, effectively denoising the supervision signal.
One-Token Rollout (Ming et al., 2025) bridges SFT toward RL by treating each token prediction as a one-step RL trajectory. The ground-truth token serves as a reward signal. This introduces on-policy learning without the complexity of full RL rollouts. Intuitively, rather than just maximizing likelihood of the next token, the method uses a policy-gradient-inspired update that incorporates whether the predicted token matches the expert.
Algorithm-Centric RL Innovations¶
Section 3.2 covers extensive algorithmic developments in RL for LLMs:
Policy optimization algorithms. Proximal Policy Optimization (PPO) (Schulman et al., 2017) has been the dominant approach, using a learned value critic for advantage estimation. Recent work shifts toward critic-free methods:
-
Group Relative Policy Optimization (GRPO) (Shao et al., 2024) replaces the value network with group-relative normalized advantages. Given a group of sampled outputs for a prompt, GRPO computes advantages by normalizing rewards within the group, eliminating the need for a separate critic model.
-
Direct REINFORCE-style updates (Li et al., 2023; Ahmadian et al., 2024; Hu, 2025; Xiong et al., 2025a) forego the complex components of PPO (value function, clipping, Generalized Advantage Estimation) for simpler gradient estimators.
Entropy regularization. A critical challenge in RL is entropy collapse—the policy becoming overly deterministic. Several approaches address this:
- Standard entropy regularization (Cheng et al., 2025; Cui et al., 2025a) adds an entropy bonus to encourage exploration.
- Weighted token entropy (He et al., 2025a; Shrivastava et al., 2025) applies different entropy weights to different tokens based on importance.
- Covariance-based clipping (Cui et al., 2025a) identifies the covariance between action probability and advantage as the key "driver" of entropy changes, proposing selective clipping based on this covariance combined with KL penalties.
Advantage estimation improvements. Cheng et al. (2025) and Chen et al. (2025c) incorporate entropy information directly into advantage estimation, stabilizing training dynamics by accounting for uncertainty in value predictions.
Data-Centric Innovations¶
Data-centric SFT (Section 3.1) focuses on selecting or synthesizing better training data:
- Quality over quantity: Zhou et al. (2023) achieve strong alignment performance with only 1,000 high-quality instruction-response pairs, challenging the assumption that SFT requires massive datasets.
- Information gain selection: FisherSFT (Deb et al., 2025b) selects training examples that maximize information gain relative to the current model, achieving sample-efficient learning.
- Domain-specific mixing: Li et al. (2025b) learn domain-specific weights for data mixing, optimizing the balance between different task types to minimize validation loss.
- Automated synthesis: Quan (2025) and Cao et al. (2025a) generate context-driven instruction-response pairs without heavy human annotation, using knowledge-guided synthesis and iterative refinement.
Data-centric RL (Section 3.2) addresses sample efficiency through data selection:
- Rollout selection: Zhang et al. (2025a) and Xu et al. (2025c) show that training on a subset of informative rollouts (high-variance or diverse examples) can achieve strong performance with fewer samples.
- Prompt selection: Zheng et al. (2025) and Qu et al. (2025) select prompts before rollout generation to reduce computational cost while maintaining performance.
- Curriculum learning: Zhang et al. (2025d) and Yao et al. (2025) dynamically select prompts of intermediate difficulty, maximizing learning signal by avoiding too-easy (no learning) or too-hard (no positive reward) examples.
- Distribution-level adaptation: Chen et al. (2025d) and Wang et al. (2025a) prioritize tasks where the model exhibits greatest advantage or lowest visitation, balancing exploration and exploitation at the distribution level.
Hybrid Training Methods¶
Section 4.4 surveys methods that combine SFT and RL beyond simple sequential application. Table 1 provides a comprehensive comparison; key methods include:
Prefix Sampling (Huang et al., 2025) uses the prefix of a ground-truth response to generate continuations. SFT loss trains on the prefix partition (ensuring the model can start correctly), while RL loss trains on the newly generated partition (optimizing for reward). This hybrid approach grounds RL exploration in expert-demonstrated starting points.
CHORD (Zhang et al., 2025e) combines weighted SFT and GRPO losses. The objective is:
where \(\mathcal{L}_{SFT-\phi}\) is a weighted SFT loss with weights \(\phi\) based on token uncertainty. This aggregates off-policy expert data with on-policy rollouts.
UFT (Unified Fine-Tuning) (Liu et al., 2025b) unifies supervised and reinforcement fine-tuning into a single process. The objective:
This balances memorization from expert data (early steps, indicated by \(*\) superscript) with exploration on policy-generated trajectories (later steps).
Interleaved training (Lv et al., 2025) switches between SFT and RL based on online performance. When performance exceeds a threshold, RL is preferred for exploration; when performance drops, SFT provides corrective guidance.
SRL (Supervised Reinforcement Learning) (Deng et al., 2025) decomposes reasoning into intermediate steps and compares online rollouts with expert trajectories, assigning rewards proportional to matching.
Methods Using SFT to Enhance RL¶
Section 4.2 covers approaches that leverage expert demonstrations within RL:
Offline demonstration + online rollout combination: - Yan et al. (2025a) introduce an off-policy guided framework augmenting on-policy updates with reasoning traces from demonstrations. - SRFT (Fu et al., 2025b) integrates supervised and reinforcement objectives in a single-stage framework, avoiding the inefficiency of sequential training. - BREAD (Zhang et al., 2025f) uses "branched rollouts anchored by expert prefixes"—starting rollouts from expert-generated prefixes reduces reliance on large demonstration sets while improving stability.
Objective modifications: - NFT (Chen et al., 2025a) enables learning from both correct and incorrect outputs under supervision while implicitly optimizing policy performance. - ReLIFT (Zhu et al., 2025b) interleaves RL with SFT on the hardest questions, acquiring capabilities beyond what pure RL achieves.
Methods Using RL Perspective to Improve SFT¶
Section 4.3 covers approaches applying RL insights to SFT:
-
DFT (Wu et al., 2025) rescales each token's loss by its predicted probability, rectifying "implicit reward bias" in SFT. The intuition is that standard SFT treats all tokens equally, but the RL perspective reveals that low-probability tokens have higher implicit reward—rescaling corrects this bias.
-
iw-SFT (Qin and Springenberg, 2025) interprets SFT as optimizing a lower bound on a sparse-reward RL objective, then tightens that bound via importance weights derived from importance sampling theory.
-
DPO-inspired SFT (Wang et al., 2024d) minimizes distribution difference between reference and policy models on offline data, applying Direct Preference Optimization insights without explicit preference pairs.
Common Practices and Takeaways (Section 6)¶
The paper distills practical guidance from the surveyed literature:
When expert data is available: SFT is preferred as an initial stage due to simpler implementation and greater stability. The "SFT then RL" pipeline is most common and typically achieves highest reported performance.
When exploration is needed: RL better balances performance gains with exploratory capacity, though it requires careful handling of entropy collapse and reward hacking.
Bridging distribution shift: Importance sampling on failed queries can treat them as informative positive samples for RFT/RL, bridging the gap between offline data and online policy.
Critical observation: "SFT may reduce the model's generalization ability compared with RL"—this is attributed to memorization versus exploration trade-offs.
Design Choices and Justifications¶
The paper makes several structural choices worth noting:
Scope limited to text-only tasks: The authors acknowledge that many impactful works extend beyond text modalities but choose to focus on text-only tasks for "clearer side-by-side comparison." Multimodal settings introduce additional confounding factors (model selection, fusion techniques) outside the scope.
Focus on online RL: The paper uses RL "exclusively to denote online RL, as it plays a predominant role in the current frontier of LLM alignment." Offline RL methods are not covered in depth.
Benchmark-oriented paper classification: For the application trend analysis (Section 5, Appendix B), papers are classified by dataset mentions rather than self-described categories. A threshold of 5+ keyword mentions is used as a "conservative filter against purely methodological contributions."
Time span 2023–2025: This captures the rapid evolution from early RLHF to current hybrid methods, including projections for 2025 based on doubling H1 counts (following the ~48% annual distribution observed for 2023–2024).
4. Key Insights and Innovations¶
Innovation 1: Mathematical Unification of SFT and RL Through Gradient Equivalence¶
The paper's most significant theoretical contribution is demonstrating that SFT can be reformulated as a special case of RL. The gradient equivalence proof (Section 4.1) shows that:
This is novel because prior work treated SFT and RL as fundamentally different paradigms—one supervised, one reinforcement-based. By showing they share a common optimization structure (reward maximization with KL regularization), the paper:
- Enables technique transfer: methods for entropy regularization in RL can address overfitting in SFT; importance sampling in RL can address distribution shift in SFT.
- Provides theoretical grounding for hybrid methods: if both optimize related objectives, combining them is theoretically principled rather than ad hoc.
- Recasts "imitation vs. exploration" as a continuum rather than a dichotomy.
This differs from prior surveys that cataloged SFT and RL methods separately without establishing formal connections.
Innovation 2: Comprehensive Taxonomy of Hybrid Methods¶
Prior to this paper, hybrid SFT-RL methods existed but lacked systematic categorization. The paper's taxonomy (Table 1, Figure 1) organizes hybrid approaches into distinct categories:
By objective modification direction: - "(SFT →) RL": Modifying RL objective based on SFT insights (e.g., adding demonstration data, DPO-style objectives) - "(RL →) SFT": Modifying SFT objective based on RL insights (e.g., probability-rescaled loss, importance-weighted SFT) - "SFT + RL": Joint optimization of both objectives (e.g., weighted combination, interleaved training)
By data source: - Offline dataset only - Offline dataset + online rollouts - Online rollouts only
This taxonomy is significant because it reveals that hybrid methods are not a monolithic category but a design space with multiple axes of variation. Practitioners can now systematically consider: "Should I modify the objective? In which direction? Should I use online data? How should I weight different sources?"
Innovation 3: Empirical Trend Analysis Showing Field-Wide Shift to Hybrid Methods¶
The paper's quantitative trend analysis (Figure 2, Appendix B) provides empirical evidence of a paradigm shift in the field. Key findings:
Hybrid training dominance: In 2023, SFT dominated at 73.3% with hybrid methods at only 20.0%. By 2024, hybrid methods became most common at 73.8%, "superseding pure SFT (19.1%)." This is not an incremental shift but a fundamental reorientation of the field's training practices.
Shift from API to open-weight models: Reliance on proprietary API-based models declined from 32.2% (2023) to 11.1% (2025), while open-weight model usage doubled from 12.2% to 25.0%. This reflects "improved benchmark coverage and growing concerns around legality, license compliance, and data provenance."
Domain-specific growth patterns: Math research grew 5× from 2023 to 2025 (492 to 2,399 projected). Code-related studies showed the most dramatic surge: +272% from 2023 to 2024, then +84% to 2025, reflecting "rapid maturation of code-centric benchmarks, tools, and evaluation pipelines."
This trend analysis is novel because prior surveys did not systematically quantify these methodological and practical shifts across the literature.
Innovation 4: Domain-Specific Methodological Patterns¶
The paper identifies distinct methodological patterns across application domains (Section 5), revealing that the "optimal" SFT/RL combination depends on task characteristics:
General QA requires hallucination management and uncertainty quantification. Methods synthesize "I do not know" outputs and use PPO to reduce factual errors.
Mathematical tasks have verifiable answers, enabling outcome-based rewards. GRPO and Critique-DPO are particularly effective because process-level rewards can be approximated from answer correctness.
Agentic tasks require long-horizon planning. Hierarchical RL (rewards at sentence level, policy updates at token level) and curriculum learning (prioritizing intermediate-difficulty tasks) address the multi-step dependency challenge.
Code tasks have executable feedback. Unit tests provide verifiable reward signals, enabling GRPO-based training on human pull requests or iterative refinement rollouts.
This domain-specific synthesis is significant because it provides actionable guidance: practitioners can now select methods based on whether their task has verifiable answers, requires multi-step planning, or benefits from executable feedback—rather than treating SFT/RL selection as domain-agnostic.
Non-Innovation: This Is Not a Methods Paper¶
It is important to clarify what this paper does not contribute: it does not propose a new algorithm, model architecture, or training procedure. The contribution is organizational and analytical—synthesizing existing literature, establishing theoretical connections, and identifying trends. This is explicitly a survey paper, not a methods paper, which affects how the work should be evaluated.
5. Experimental Analysis¶
As a survey paper, there are no original experiments with new results. Instead, the "experimental" analysis consists of:
Literature Classification Methodology¶
Dataset benchmark-oriented search: The authors conducted keyword searches over arXiv preprints (CS categories) from January 2023 to June 2025, using 26 datasets across four domains (Table 4) as domain-specific query keys:
- Agentic: webshop, webarena, wind2web, miniwob++, scienceworld, alfworld, tdw-mat, c-wah, alfred, rlcard
- QA: hotpotqa, strategyqa, triviaqa, pubmedqa, musique, 2wikimultihopqa, qasper
- Math: gsm8k, asdiv, svamp, aime
- Code: swe-bench, humaneval, livecodebench, bird, intercodesql
Classification threshold: A paper is assigned to a domain if it contains ≥5 mentions of any dataset in that domain. Papers can belong to multiple domains (non-exclusive assignment).
Projection methodology for 2025: The first six months of 2023 and 2024 accounted for 47.38% and 49.81% of annual totals respectively. The authors project 2025 full-year counts by doubling H1 2025 observations.
Quantitative Trend Results¶
Publication volume growth (Table 5, Appendix B.4):
| Domain | 2023 | 2024 | 2025 (projected) | Growth 2023→2024 |
|---|---|---|---|---|
| QA | 292 | 652 | 983 | +123% |
| Math | 492 | 1,098 | 2,399 | +123% |
| Agentic | 100 | 179 | 261 | +79% |
| Code | 115 | 428 | 786 | +272% |
The code domain shows the most dramatic acceleration: +272% from 2023 to 2024, then +84% to 2025. Math shows the largest absolute scale with near 5× growth from 2023 to 2025.
Methodology shift (Figure 2, middle panel):
| Year | SFT-only | Both (Hybrid) | RL-only |
|---|---|---|---|
| 2023 | 73.3% | 20.0% | 6.7% |
| 2024 | 19.1% | 73.8% | 7.1% |
| 2025 | 17.6% | 70.6% | 11.8% |
The shift from SFT-dominant (2023) to hybrid-dominant (2024–2025) is stark. By 2024, hybrid methods are 3.9× more common than SFT-only.
Data source shift (Figure 2, right panel):
| Year | Benchmark | Open-weight Model | API Model | Human/Web |
|---|---|---|---|---|
| 2023 | 48.9% | 12.2% | 32.2% | 6.7% |
| 2024 | 54.7% | 17.5% | 19.9% | 7.9% |
| 2025 | 61.1% | 25.0% | 11.1% | 2.8% |
API model usage declined by 66% (32.2% → 11.1%) while open-weight model usage more than doubled (12.2% → 25.0%). Human/web-curated data shrank to 2.8%, reflecting "growing concerns around legality, license compliance, and data provenance."
Assessment of Methodology¶
Strengths:
-
Reproducible methodology: The benchmark-oriented search with explicit threshold (≥5 keyword mentions) is transparent and replicable.
-
Conservative classification: The 5-mention threshold is justified by the observation that papers genuinely using a dataset typically "introduce it, report results on it, and provide comparative or analytical discussion."
-
Cross-validation of projections: The projection methodology (doubling H1 counts) is grounded in actual 2023–2024 distributions (~48% H1), not arbitrary assumptions.
Limitations (acknowledged by authors in Appendix B.2):
-
Approximation bias: "Commonly adopted benchmarks are not fully inclusive of all real-world scenarios or methodologies." Papers using non-standard datasets may be undercounted.
-
Keyword sensitivity: Low thresholds may include papers that mention a dataset "only in passing," while high thresholds may exclude "legitimate work that uses a benchmark but references it sparsely."
-
Cross-domain double-counting: Papers evaluating across multiple domains are counted in each, potentially inflating totals (though the authors note this is "relatively rare").
-
Naming variations: Unless normalized, naming variations may lead to undercounting.
Do the Analyses Support the Claims?¶
Claim: "Rapid growth across domains" — Strongly supported. All four domains show substantial growth (+79% to +272% year-over-year). The math and code domains show particularly dramatic scaling.
Claim: "Growing adoption of integrated SFT–RL training" — Strongly supported. The shift from 20% hybrid (2023) to 74% hybrid (2024) is decisive evidence. The authors' statement that hybrid training "becom[es] the most common approach at 73.8%, superseding pure SFT (19.1%)" accurately reflects Figure 2.
Claim: "Continued shift from API-based labeling to open-weight–generated datasets" — Strongly supported. The 66% decline in API model usage and 2× increase in open-weight usage provides clear evidence. The authors correctly identify the drivers as "improved benchmark coverage and growing concerns around legality, license compliance, and data provenance."
Claim: "SFT can be viewed as special case of RL" — Supported through mathematical derivation. The gradient equivalence proof is technically sound, though the practical implications depend on whether techniques actually transfer effectively—a claim that would require empirical validation beyond the survey's scope.
Missing Analyses¶
-
No performance comparison: The survey does not quantify whether hybrid methods actually outperform pure SFT or pure RL on downstream tasks. The trend analysis shows adoption, not effectiveness.
-
No cost analysis: The paper does not compare computational costs of different approaches (e.g., SFT vs. RL vs. hybrid), though Appendix C provides rough hardware requirements.
-
Limited ablation of survey methodology: While alternative thresholds are mentioned, the paper does not systematically analyze how sensitive the trends are to threshold choice.
-
No citation network analysis: The trend analysis counts papers but does not analyze citation patterns, which could reveal which methods are most influential rather than most numerous.
Hardware Requirements Appendix¶
Appendix C provides rough guidance:
- SFT (full fine-tuning): ~16GB VRAM per billion parameters (FP16), versus ~2GB/B for inference
- SFT (LoRA/QLoRA): 5–20GB on contemporary GPUs depending on model size
- RL training: 15–25GB VRAM per billion parameters for LoRA-based RL, approximately 1.5–3× higher than SFT
- Scale-dependent requirements: 1–3B models on single 80GB GPU; 7–14B models on 2–4 GPUs; 32B+ on 4–8 GPUs
These are explicitly described as "approximate starting points rather than strict requirements."
6. Limitations and Trade-offs¶
Scope Limitations: Text-Only Focus Excludes Multimodal Realities¶
The paper explicitly restricts its application analysis to "text-only tasks" (Section 5, Appendix B.1), acknowledging that multimodal settings "introduce additional sources of uncertainty in model selection and fusion techniques." This is a significant limitation given that many frontier LLM applications are inherently multimodal—vision-language models for GUI agents (Hu et al., 2025a; Nguyen et al., 2025), speech-augmented systems (Cui et al., 2025c; Yang et al., 2025c), and retrieval-augmented generation across modalities (Abootorabi et al., 2025).
The authors' justification—that text-only enables "clearer side-by-side comparison"—is methodologically sound for a survey establishing foundational understanding. However, it leaves open whether the unified SFT-RL framework transfers to multimodal settings where: - Reward signals may be noisier (e.g., human preferences on image outputs are less consistent than on text) - The policy gradient formulation may need modification for continuous action spaces (image generation) - SFT demonstration quality varies more dramatically across modalities
The paper provides additional references in Appendix D but does not analyze whether the SFT-as-special-case-of-RL argument holds when the action space is not discrete tokens.
Focus on Online RL Excludes Offline RL Paradigms¶
The paper uses RL "exclusively to denote online RL" (Section 2), explicitly excluding offline reinforcement learning methods. This choice is justified by the observation that online RL "plays a predominant role in the current frontier of LLM alignment," but it creates a blind spot:
Offline RL methods (e.g., Conservative Q-Learning, Implicit Q-Learning) have shown promise for LLM post-training in scenarios where online interaction is costly or risky. The unified objective framework presented in Section 4.1 could potentially be extended to offline RL settings, but the paper does not explore this direction.
Practical consequence: Practitioners working in constrained environments (limited compute, safety-critical domains where online exploration is infeasible) will not find guidance on whether offline RL fits into the SFT-RL unification or whether the hybrid training methods reviewed generalize to offline settings.
Trend Analysis Methodology Has Boundary Effects¶
The benchmark-oriented paper classification (Appendix B.2) uses a fixed threshold of ≥5 keyword mentions for domain assignment. While the authors justify this as a "conservative filter against purely methodological contributions," several limitations emerge:
Threshold arbitrariness: Table 5 shows that counts decrease consistently as the threshold increases (e.g., QA papers: 1,308 → 558 as threshold goes from >0 to >5 for 2025 projections). The choice of 5 is defensible but not derived from any principled criterion. The authors acknowledge this, stating:
"low thresholds may inflate counts by including papers that mention a dataset only in passing, while high thresholds may exclude legitimate work that uses a benchmark but references it sparsely."
Undercounting of novel datasets: Papers introducing new benchmarks not in the 26-dataset search keys (Table 4) would be missed entirely. This creates a conservative bias toward established methods and against innovative work using non-standard evaluation.
Non-English publications excluded: The arXiv search covers only the CS categories, potentially missing relevant work published in other languages or venues (conference proceedings not yet on arXiv, industry technical reports).
Projection uncertainty for 2025: The methodology doubles H1 2025 counts based on ~48% H1 distribution in 2023–2024. However, if 2025 publication patterns differ (e.g., major conferences shifted dates), projections could be systematically biased.
No Quantitative Performance Comparison¶
The survey catalogs methods and trends but does not provide a comparative performance analysis of SFT, RL, and hybrid approaches. Key questions remain unanswered:
- Do hybrid methods actually outperform pure SFT or pure RL on downstream benchmarks?
- What is the magnitude of improvement from the unified objectives reviewed in Section 4?
- How do computational costs compare across paradigms?
The authors explicitly note this is outside the survey's scope—the contribution is organizational, not empirical. However, practitioners seeking to decide between SFT-only, RL-only, or hybrid training will find no direct head-to-head performance data.
Hardware Requirements Are Rough Heuristics Only¶
Appendix C provides hardware guidance but explicitly states these are "approximate starting points rather than strict requirements." The estimates (e.g., 15–25GB VRAM per billion parameters for LoRA-based RL) come from community-reported guidance (modal.com, verl documentation) rather than systematic benchmarking.
Uncontrolled variables that affect actual requirements: - Batch size and sequence length (not specified) - Rollout generation strategy (affects memory for storing trajectories) - Gradient checkpointing and quantization settings - Optimizer choice (AdamW vs. alternatives with different state requirements)
The 1.5–3× higher memory for RL versus SFT is a useful rule of thumb but could be significantly off depending on implementation details.
Theoretical Claims Lack Empirical Validation¶
The gradient equivalence proof (Section 4.1) showing SFT as a special case of RL is mathematically sound. However, the practical implications depend on assumptions that are not empirically validated:
Assumption: Techniques transfer effectively. The paper claims that "the tricks that work for one thus have potential to be applied to the other." This is plausible but unverified. For example: - Does entropy regularization from RL actually mitigate SFT overfitting in practice? - Does importance sampling address distribution shift in SFT as effectively as the theory suggests?
Assumption: The indicator reward is a useful lens. The reformulation shows that SFT implicitly uses \(-\frac{I\{y = y_i\}}{\pi_\theta(y | x)}\) as a reward. But this reward is extremely sparse (nonzero only for exact matches) and high-variance (scaled by inverse probability). Whether this perspective leads to better algorithms remains an open question.
Domain-Specific Recommendations Lack Ablation¶
Section 5 identifies methodological patterns for each domain (QA uses hallucination management; Math uses outcome-based rewards; Agents use hierarchical RL; Code uses unit tests). However, these are observational correlations rather than causal recommendations:
- The paper does not test whether these patterns are optimal versus alternatives
- No ablation studies show what happens when, e.g., a math task uses hierarchical RL instead of outcome rewards
- The domain characteristics (Table 3) are qualitative and not rigorously linked to method effectiveness
Consequence: Practitioners receive guidance on what others have done, but not rigorous evidence on what they should do.
Open Problems Explicitly Acknowledged¶
The paper identifies two open challenges in Section 7:
-
Sample- and compute-efficient methodologies: Current pipelines require "substantial computational resources, large volumes of high-quality data, and extensive rollout generation."
-
SFT and RL under sparse or indirect reward signals: Many real-world tasks lack well-defined rewards, with user feedback that is "sparse, inconsistent, or costly."
The authors call for future work on these fronts but do not provide concrete roadmaps. The second challenge—learning from implicit supervision such as "self-evaluation signals or user-churn behaviors"—is particularly underexplored in the surveyed literature.
7. Implications and Future Directions¶
How This Work Changes the Landscape¶
This survey establishes a theoretical foundation for the emerging hybrid training paradigm. Prior to this work, the shift toward hybrid SFT-RL methods (from 20% in 2023 to 74% in 2024) was observable but not understood. By showing that SFT and RL optimize related objectives—reward maximization with KL regularization—the paper provides the conceptual scaffolding to explain why hybrid methods work:
-
Not ad hoc combination: Hybrid training is not a "try both and see" heuristic but a principled integration of complementary mechanisms—imitation for stability, exploration for generalization.
-
Transfer of techniques becomes meaningful: If entropy regularization prevents collapse in RL, applying similar concepts to SFT (as in GEM's entropy regularization) is not coincidental but theoretically motivated.
-
Framework for future algorithm design: New methods can be designed to explicitly balance the imitation signal (SFT's indicator reward) and exploration signal (RL's learned reward) rather than defaulting to sequential pipelines.
Practical impact: Organizations designing post-training pipelines can now reason about the SFT-RL trade-off in terms of explicit design choices (reward function, KL regularization strength, data mixing ratio) rather than treating the two paradigms as black boxes.
Follow-Up Research This Work Enables¶
1. Empirical validation of technique transfer. The unified framework predicts that techniques should transfer between SFT and RL. A natural follow-up is systematic empirical testing: - Does importance sampling (effective in RL) improve SFT generalization on reasoning tasks? - Does entropy regularization (from RL) measurably reduce SFT overfitting? - Does curriculum learning (from RL) improve SFT sample efficiency?
The survey provides the theoretical justification; experimental work is needed to confirm or refute the practical effectiveness.
2. Optimal hybrid training schedules. The paper surveys methods that interleave SFT and RL (Lv et al., 2025), combine weighted objectives (Zhang et al., 2025e), or use prefix-based hybrid training (Huang et al., 2025). But no work systematically compares these approaches: - When should SFT come before RL versus interleaved? - What is the optimal weighting between \(\mathcal{L}_{SFT}\) and \(\mathcal{L}_{RL}\)? - Should the schedule be static or dynamically adapted based on performance?
Future work could develop a "training schedule theory" analogous to learning rate schedules, but for the SFT-RL balance.
3. Extension to offline RL and semi-supervised settings. The paper's focus on online RL leaves offline RL unexplored. Extending the unified framework to offline RL settings—where interaction is limited but demonstration data is abundant—would expand applicability to: - Medical LLMs where online exploration is unethical - Legal or financial domains where errors are costly - Low-resource settings without compute for rollouts
4. Implicit reward discovery. Section 7 calls for "leveraging additional natural but underexplored sources of implicit supervision." Concrete directions include: - User behavior signals (churn, engagement time, reformulation patterns) as implicit feedback - Self-evaluation signals (the model's confidence in its own outputs) - Environmental feedback (execution results, tool outputs) beyond explicit correctness
The unified framework suggests these could be incorporated as reward signals within the RL objective, but the specifics remain unexplored.
5. Domain-specific optimal method selection. The paper identifies domain-specific patterns but does not provide prescriptive guidance. Follow-up work could develop: - Decision frameworks matching task characteristics to optimal training paradigms - Quantitative models predicting when SFT, RL, or hybrid methods will perform best based on data characteristics (quality, quantity), task properties (verifiability, horizon length), and resource constraints - Benchmark suites specifically designed to evaluate SFT vs. RL vs. hybrid methods
Practical Applications and Downstream Use Cases¶
For practitioners designing post-training pipelines:
The survey provides actionable guidance for three common scenarios:
-
High-quality expert data available, limited compute: Start with SFT. The paper notes SFT is "preferred as an initial training stage" due to "simpler implementation and greater stability." Methods like FisherSFT (Deb et al., 2025b) can maximize sample efficiency.
-
Verifiable task with clear reward signal (math, code): Consider RL-first or hybrid approaches. Outcome-based rewards enable GRPO (Shao et al., 2024) or DPO variants without complex reward modeling.
-
Agentic tasks with long horizons: Hybrid approaches combining hierarchical RL (sentence-level rewards, token-level updates) with demonstration anchoring (prefix sampling) address the multi-step dependency challenge.
Hardware planning: The rough guidance in Appendix C suggests practitioners should allocate: - 5–20GB VRAM for LoRA-based SFT on contemporary models - 1.5–3× more memory for RL (15–25GB/B parameters) - Multi-GPU setups for 7B+ models in RL training
For researchers:
The survey's taxonomy (Figure 1, Table 1) provides a structured literature review starting point. The open problems in Section 7—sample efficiency and sparse-reward learning—offer concrete research directions with clear motivation.
When to Prefer Specific Approaches¶
Based on the synthesized literature:
Prefer SFT when: - Expert demonstrations are high-quality and representative of target distribution - Task requires memorization of specific knowledge patterns (factual QA, domain expertise) - Compute budget is limited and stability is critical - The model is "unfamiliar with the downstream task or cannot efficiently generate sufficient positive samples" (Section 6)
Prefer RL when: - Task has clear, verifiable reward signals (math, code execution, game environments) - Exploration is needed to discover behaviors absent from demonstrations - Generalization to novel problem structures is important - The training data contains preference information rather than demonstrations
Prefer hybrid SFT-RL when: - Expert data is available but may not cover all edge cases - Task requires both accuracy (from imitation) and robustness (from exploration) - The SFT model shows generalization gaps that RL could address - Resources permit both demonstration-based and rollout-based training
The unified framework suggests the optimal approach is rarely pure SFT or pure RL, but rather a carefully designed combination—matching the empirical trend toward hybrid dominance identified in the survey.