A Survey of On-Policy Distillation for Large Language Models¶

Pitch¶

This survey tackles exposure bias—the critical flaw in conventional LLM distillation where students trained on static teacher data fail to recover from their own errors during inference. By letting students generate their own trajectories and receive teacher feedback on these self-generated outputs, On-Policy Distillation reduces compounding error from quadratic to linear scaling, enabling robust capability transfer to smaller models. The work unifies this fragmented field under a rigorous mathematical framework, organizing methods across three orthogonal dimensions and charting the path toward more reliable, deployable student models.

1. Executive Summary¶

This survey provides the first comprehensive treatment of On-Policy Distillation (OPD) for Large Language Models, unifying a fragmented literature spanning divergence minimization, reinforcement learning, and self-play under a single mathematical framework. The core contribution is a theoretically grounded taxonomy organized along three orthogonal dimensions—feedback signal type (logit-based, outcome-based, self-play), teacher access level (white-box, black-box, teacher-free), and loss granularity (token-level, sequence-level, hybrid)—that reveals how methods like GKD, MiniLLM, and DistiLLM instantiate the same underlying \(f\)-divergence objective with different parameterizations. The survey demonstrates that shifting from off-policy to on-policy training distributions eliminates exposure bias (reducing compounding error from \(O(\epsilon T^2)\) to \(O(\epsilon T)\)) and establishes that adaptive divergence selection consistently outperforms fixed divergence choices across reasoning and generation tasks.

2. Context and Motivation¶

The Core Problem: Exposure Bias in Off-Policy Distillation¶

The fundamental problem this survey addresses is exposure bias—a train-test distribution mismatch that plagues conventional knowledge distillation for autoregressive language models. In standard off-policy distillation, the student model trains on static, pre-generated teacher outputs and learns to predict the teacher's token distributions conditioned on prefixes from this fixed dataset. At inference time, however, the student must generate autoregressively, conditioning each token on its own previously generated (potentially erroneous) outputs. Because the student never encountered these self-induced error states during training, it lacks supervisory signals for recovery, causing prediction errors to compound over sequence length.

The survey formalizes this through the lens of imitation learning theory. Drawing on the DAgger algorithm analysis from Ross et al. (2011), the survey shows that if a student policy \(\pi_\theta\) mimics a teacher policy \(\pi^*\) with per-step error bounded by \(\epsilon\) under the training distribution, the expected total discrepancy over a trajectory of length \(T\) under the student's own distribution scales quadratically as \(O(\epsilon T^2)\). For modern LLMs generating sequences spanning thousands of tokens, this compounding error is severe—a single suboptimal token shifts the prefix out-of-distribution, making subsequent errors more likely.

Why This Problem Matters¶

The survey identifies several practical implications that make exposure bias economically significant:

Deployment costs and capability transfer. As frontier models scale to hundreds of billions of parameters (e.g., DeepSeek-R1 at 671B parameters, GPT-4), inference costs become prohibitive for most deployment scenarios. Knowledge distillation has accordingly grown from a compression technique into a "general-purpose capability transfer engine." The release of DeepSeek-R1 demonstrated successful transfer of chain-of-thought reasoning from a 671B mixture-of-experts teacher into dense students ranging from 1.5B to 70B parameters. However, if distilled students degrade on tasks requiring sustained multi-step generation due to exposure bias, the economic value of distillation is undermined.

Conflicting evidence in prior work. The survey highlights a genuine contradiction in the literature. On one side, several works show LLMs can use test-time computation productively through self-correction and verifier-guided sampling. On the other, studies like Gudibande et al. (2023) and Huang et al. (2023) found that "large language models cannot self-correct reasoning yet" and that prompting off-the-shelf LLMs to revise their outputs is "largely ineffective." The survey reconciles these findings by showing that the effectiveness depends critically on whether training exposes the student to its own error states—a factor prior work did not systematically control.

Theoretical gap. While pretraining scaling laws (Hoffmann et al., 2022) provide principled compute allocation guidance, and recent work has begun addressing distillation scaling laws (Busbridge et al., 2025), no equivalent theoretical framework existed for on-policy distillation. The generation cost of student rollouts introduces an additional compute axis that fundamentally changes the optimization landscape, leaving practitioners to rely on trial-and-error for budget allocation.

Where Existing Approaches Fall Short¶

The survey identifies specific limitations in prior work along three axes:

Off-policy distillation assumes error-free prefixes. Classical token-level KD (Hinton et al., 2015) and its LLM extensions minimize divergence between teacher and student distributions over prefixes from a fixed dataset. As Section 2.2 formalizes, the loss \(\mathcal{L}_{\text{Token-KD}} = \mathbb{E}_{x,y \sim D}[\sum_{t=1}^{|y|} D_{KL}(p_T(\cdot|x, y_{<t}) \| p_\theta(\cdot|x, y_{<t}))]\) assumes \(y_{<t}\) comes from the data distribution, ignoring the sequential, compounding nature of generation.

Sequence-level KD remains off-policy. Kim & Rush (2016) proposed sequence-level distillation that matches joint probability distributions over entire sequences, approximating the teacher's distribution as a Dirac delta at the beam-search output. However, the survey notes that "Sequence-Level KD remains an off-policy algorithm. The student still trains via teacher-forcing on static trajectories."

White-box assumptions limit applicability. Many proposed methods require full access to teacher logits, but frontier models like GPT-4 and Claude expose only API-level access returning generated text. The survey notes that "existing surveys of LLM distillation predominantly organize the field around the classical compression framing, treating off-policy and on-policy approaches as interchangeable variants rather than as fundamentally distinct paradigms."

Fragmented literature across communities. Methods from the knowledge distillation community, the RLHF community, and the imitation learning community address the same underlying problem with "different formalisms, different evaluation protocols, and different terminology." No prior work provided a unified mathematical treatment connecting these approaches.

How This Survey Positions Itself¶

The survey explicitly frames itself as filling a conceptual gap rather than proposing new methods. It organizes OPD methods through a three-dimensional taxonomy (Section 3), provides a unified \(f\)-divergence framework (Section 2.5), and bridges white-box and black-box regimes (Sections 4-5). The survey draws direct parallels to DAgger from imitation learning, establishing that "shifting from off-policy to on-policy training distributions... is a consistently impactful design choice, yielding meaningful accuracy gains across mathematical reasoning, code generation, and instruction following."

3. Technical Approach¶

3.1 Reader Orientation¶

This is a survey paper that synthesizes and unifies the literature on On-Policy Distillation for LLMs, organizing disparate methods under a single mathematical framework and providing a taxonomy that reveals their relationships. The core intellectual contribution is conceptual: showing that methods like GKD, MiniLLM, and DistiLLM are not heuristics but systematic instantiations of a unified objective parameterized by sampling policy, divergence function, and argument ordering.

3.2 Big-Picture Architecture (Diagram in Words)¶

The survey constructs a conceptual framework with four major components:

The Unified Objective (Equation 14) — a generalized formulation \(\mathcal{L}_{\text{OPD}}(\theta) = \mathbb{E}_{y \sim \pi_{\text{mix}}}[\sum_{t=1}^{|y|} D_f(p_T(\cdot|x, y_{<t}), p_\theta(\cdot|x, y_{<t}))]\) that decouples the sampling distribution from the divergence metric, allowing any method to be described by specifying \(\pi_{\text{mix}}\), \(f\), and argument ordering.
The Three-Dimensional Taxonomy (Figure 2) — an organizational scheme with orthogonal axes: (a) feedback signal (logit-based, outcome-based, self-play), (b) teacher access (white-box, black-box, self-distillation), and (c) granularity (token-level, sequence-level, hybrid).
The f-Divergence Family (Section 2.4) — a mathematical framework for understanding distribution matching, where Forward KL is mode-covering (zero-avoiding), Reverse KL is mode-seeking (zero-forcing), and Jensen-Shannon Divergence provides bounded, symmetric gradients.
The Exposure Bias Analysis (Section 2.3) — a formalization of the train-test mismatch through MDP framing, establishing that on-policy sampling reduces compounding error from \(O(\epsilon T^2)\) to \(O(\epsilon T)\).

3.3 Roadmap for the Deep Dive¶

First, the mathematical foundations (Section 2.1-2.3): knowledge distillation fundamentals, token-level vs. sequence-level formulations, and the exposure bias formalization, since these establish the problem OPD solves.
Second, the \(f\)-divergence framework (Section 2.4) and unified objective (Section 2.5), which provide the analytical language for comparing methods.
Third, the taxonomy dimensions (Section 3), which organize the methodological landscape.
Fourth, white-box token-level methods (Section 4.1), the most active research direction, tracing the evolution from fixed to adaptive divergences.
Fifth, sequence-level and hybrid methods (Sections 4.2-4.3), which address the variance-bias tradeoff.
Sixth, black-box and self-distillation methods (Section 5), which handle restricted teacher access.

3.4 Detailed Technical Breakdown¶

Foundational Framing. The survey begins by establishing that autoregressive LLM generation can be cast as a sequential decision problem. The state space \(\mathcal{S}\) comprises all token prefixes \(s_t = (x, y_{<t})\), the action space \(\mathcal{A}\) is the vocabulary \(\mathcal{V}\), and transitions are deterministic: action \(y_t\) in state \(s_t\) yields \(s_{t+1} = (x, y_{<t}, y_t)\).

Knowledge Distillation Fundamentals¶

Classical knowledge distillation (Hinton et al., 2015) transfers "dark knowledge"—the relational structure embedded in softened probability distributions—from a teacher network \(T\) to a student \(S\). For a vocabulary of size \(|\mathcal{V}|\), logits \(z \in \mathbb{R}^{|\mathcal{V}|}\) are softened using temperature \(\tau > 0\):

\[p(y|x; \tau) = \frac{\exp(z_y/\tau)}{\sum_{y' \in \mathcal{V}} \exp(z_{y'}/\tau)}\]

The distillation loss minimizes KL divergence between teacher and student softened distributions:

\[\mathcal{L}_{\text{KD}} = \tau^2 D_{KL}(p_T(\cdot|x; \tau) \| p_\theta(\cdot|x; \tau))\]

The \(\tau^2\) scaling factor ensures gradients remain commensurate with hard-label cross-entropy. The survey derives that as \(\tau \to \infty\), this degenerates to minimizing Mean Squared Error between raw logits: \(\partial \mathcal{L}_{\text{KD}} / \partial z^S_i \approx \frac{1}{|\mathcal{V}|}(z^S_i - z^T_i)\).

Token-Level vs. Sequence-Level Distillation¶

Token-level distillation applies classical KD to autoregressive generation by computing loss at each position independently:

\[\mathcal{L}_{\text{Token-KD}} = \mathbb{E}_{x,y \sim D}\left[\sum_{t=1}^{|y|} D_{KL}(p_T(\cdot|x, y_{<t}) \| p_\theta(\cdot|x, y_{<t}))\right]\]

While computationally tractable, this misaligns with the global generative objective because it optimizes next-token accuracy while assuming error-free history.

Sequence-level distillation (Kim & Rush, 2016) aims to match joint probability distributions:

\[\mathcal{L}_{\text{Seq-KD}} = D_{KL}(P_T(y|x) \| P_\theta(y|x)) = \sum_{y \in \mathcal{Y}} P_T(y|x) \log\left(\frac{P_T(y|x)}{P_\theta(y|x)}\right)\]

Because the sequence space grows exponentially (\(|\mathcal{Y}| = |\mathcal{V}|^T\)), exact computation is intractable. Kim & Rush (2016) approximated the teacher distribution as a Dirac delta at the beam-search output \(\hat{y}\), reducing the loss to negative log-likelihood on the teacher's highest-probability sequence.

The Exposure Bias Formalization¶

The survey formalizes the core problem through the MDP framework. During off-policy training, the student encounters states governed by the data distribution \(d_D(s)\). The training objective is:

\[\mathcal{L}_{\text{train}} = \mathbb{E}_{s \sim d_D}[D_{KL}(\pi^*(\cdot|s) \| \pi_\theta(\cdot|s))]\]

At inference, the student acts according to its own learned policy \(\pi_\theta\), inducing a different state visitation distribution \(d_{\pi_\theta}(s)\). Exposure bias arises from \(d_D(s) \neq d_{\pi_\theta}(s)\).

The survey quantifies this using DAgger bounds: if the student mimics the teacher with per-step error \(\epsilon\) under the training distribution, the expected total error under the student's own distribution scales as:

\[\mathbb{E}_{s \sim d_{\pi_\theta}}\left[\sum_{t=1}^T \mathcal{L}(s_t)\right] \leq O(\epsilon T^2)\]

For modern LLMs generating sequences of thousands of tokens, this quadratic compounding is catastrophic. On-policy distillation resolves this by changing the training expectation from \(\mathbb{E}_{s \sim d_D}\) to \(\mathbb{E}_{s \sim d_{\pi_\theta}}\), reducing the bound to \(O(\epsilon T)\).

A critical nuance. The survey notes that the DAgger bound assumes an "interactive expert providing optimal actions in any state." In white-box OPD, if the student generates a severely out-of-distribution prefix, "the teacher's conditional distribution may itself become poorly calibrated. Forcing the student to match this noisy distribution violates the core assumption... potentially destabilizing training." This explains why adaptive divergence methods are critical.

The f-Divergence Framework¶

To parameterize distribution matching, the survey introduces the \(f\)-divergence family. Given distributions \(P\) and \(Q\):

\[D_f(P \| Q) = \int_{\mathcal{X}} Q(x) f\left(\frac{P(x)}{Q(x)}\right) dx = \mathbb{E}_{x \sim Q}\left[f\left(\frac{P(x)}{Q(x)}\right)\right]\]

Different choices of \(f\) yield different properties:

Forward KL (\(f(u) = u \log u\)): \(D_{KL}(P \| Q) = \mathbb{E}_{x \sim P}[\log(P(x)/Q(x))]\). The gradient forces the student \(Q\) to place mass everywhere the teacher \(P\) has mass. If \(P(x) > 0\) but \(Q(x) \approx 0\), the penalty diverges. This produces mode-covering (zero-avoiding) behavior—"the student covers both modes but places mass in the inter-mode 'hallucination zone'" (Figure 1).

Reverse KL (\(f(u) = -\log u\)): \(D_{KL}(Q \| P) = \mathbb{E}_{x \sim Q}[\log(Q(x)/P(x))]\). If \(Q(x) > 0\) where \(P(x) \approx 0\), the penalty diverges. The student assigns probability only to regions supported by the teacher, producing mode-seeking (zero-forcing) behavior—collapsing onto one major mode while ignoring minor ones.

Jensen-Shannon Divergence: Symmetric and bounded in \([0, \log 2]\), providing "stable gradient field that balances mode-seeking and mode-covering behavior."

Figure 1 provides geometric intuition: for a bimodal teacher distribution, Forward KL produces a student that covers both modes but hallucinates in between, while Reverse KL produces a student that concentrates on one mode, dropping the other.

The Unified Mathematical View¶

The survey's key theoretical contribution is Equation 14, a generalized OPD objective:

\[\mathcal{L}_{\text{OPD}}(\theta) = \mathbb{E}_{y \sim \pi_{\text{mix}}}\left[\sum_{t=1}^{|y|} D_f(p_T(\cdot|x, y_{<t}), p_\theta(\cdot|x, y_{<t}))\right]\]

where \(\pi_{\text{mix}}\) is the behavioral policy driving state exploration and \(D_f\) is a divergence from the \(f\)-divergence family. The survey uses two-argument notation \(D_f(P, Q)\) rather than \(D_f(P \| Q)\) because different methods place teacher and student in different argument positions.

GKD (Agarwal et al., 2024) defines \(\pi_{\text{mix}}\) as interpolation between dataset and student: output sequences are drawn with probability \(\lambda\) from \(p_\theta\) and with probability \(1-\lambda\) from \(D\). GKD tests Forward KL, Reverse KL, and JSD, defaulting to JSD for its bounded gradients. Setting \(\lambda \to 1\) makes GKD purely on-policy.

MiniLLM (Gu et al., 2024) selects Reverse KL for mode-seeking behavior: \(D_{KL}(p_\theta \| p_T)\). For sampling, it uses a teacher-mixed strategy \(\pi_{\text{mix}} = (1-\alpha)p_\theta + \alpha p_T\) with \(\alpha = 0.2\). Because Reverse KL places student parameters in both the expectation and log-ratio, MiniLLM reformulates via REINFORCE:

\[\nabla_\theta \mathcal{L}_{\text{MiniLLM}} = -\mathbb{E}_{y \sim p_\theta}\left[\sum_{t=1}^{|y|} (R_t - 1) \nabla_\theta \log p_\theta(y_t|y_{<t})\right]\]

where \(R_t = \sum_{t'=t}^{|y|} \frac{\log p_T(y_{t'}|y_{<t'})}{p_\theta(y_{t'}|y_{<t'})}\) is the cumulative reward-to-go.

DistiLLM (Ko et al., 2024) addresses REINFORCE instability through skewed KL divergences. It defines a skewed mixture \(\tilde{p} = \alpha p_T + (1-\alpha)p_\theta\) and minimizes \(D_{KL}(p_T \| \tilde{p})\). Because the expectation is over the fixed teacher \(p_T\), the loss reduces to tractable cross-entropy without policy gradients.

The unified perspective shows these are not disparate heuristics but "systematic, progressive exploration of a design space governed by three interacting choices: trajectory sampling (\(\pi_{\text{mix}}\)), divergence generator (\(f\)), and the argument ordering and target distribution."

White-Box Token-Level Methods¶

The survey organizes token-level methods (Section 4.1) around three conceptual threads.

From Fixed to Adaptive Divergences. GKD established the canonical framework with mixture sampling \(\lambda \in [0, 1]\). DistiLLM addressed gradient instability through skewed KL with parameter \(\alpha\), combining Skewed Forward KL on teacher outputs with Skewed Reverse KL on student outputs.

ToDi (Jung et al., 2025) introduced per-token adaptive divergence: \(\mathcal{L}_{\text{ToDi}} = \mathbb{E}_{y \sim D}[\sum_t \omega_t D_f^{(t)}(p_T \| p_\theta)]\) where the adaptive weighting \(\omega_t\) derives from the teacher-student log-probability ratio. When the student underestimates the teacher's high-confidence predictions, Reverse KL dominates; when the teacher is uncertain, Forward KL preserves diversity.

Entropy-Aware OPD (Jin et al., 2026) gates divergence by teacher entropy: \(\alpha_t\) increases with \(H(p_T(\cdot|y_{<t}))\). In low-entropy regions, Reverse KL dominates for precise imitation; in high-entropy regions, Forward KL captures the full range of plausible outputs. On Qwen3-4B-Base, this achieves "+5.05 Pass@8 improvement over standard Reverse KL on mathematical reasoning."

G-OPD (Yang et al., 2026b) proves OPD is a special case of dense KL-constrained RL. When \(\alpha > 1\) (ExOPD), the student is incentivized to exceed the teacher's distribution, "breaking the imitation ceiling."

Token Weighting and Selection. Orthogonal to divergence choice is which tokens matter most. AdaKD (Xie et al., 2025) adjusts per-token weights based on student-teacher gap. SelecTKD (Huang et al., 2025) selects tokens based on learning value. Concrete Score Distillation (CSD) (Kim et al., 2026b) operates directly on logits rather than probabilities, "avoiding the information loss introduced by softmax normalization."

Cross-Architecture Distillation. DSKD (Zhang et al., 2025b) addresses vocabulary mismatch through dual projectors mapping hidden states between teacher and student representation spaces. Delta KD (Cao et al., 2025) preserves the distributional shift from pre-training to fine-tuning rather than matching absolute distributions.

Sequence-Level and Hybrid Methods¶

Sequence-level methods (Section 4.2) reframe distillation as trajectory-level RL. MiniLLM's REINFORCE derivation shows that minimizing sequence-level Reverse KL is equivalent to policy gradient RL with reward \(r(x, y) = \log p_T(y|x) - \log p_\theta(y|x)\). However, "REINFORCE estimation in massive combinatorial output spaces demands substantially more training iterations."

Zimmer et al. (2025) recast distillation as a constrained MDP, maximizing task reward subject to hard constraint \(D_{KL}(p_\theta \| p_T) \leq \epsilon\). KETCHUP (Fan et al., 2025) uses K-step returns via Bellman Optimality to reduce REINFORCE variance.

The survey identifies a fundamental tradeoff: "Token-level methods optimize an upper bound of the sequence-level divergence... enforced local alignment yields low variance and high stability, but greedy optimization can perfect local syntax while missing global coherence. Sequence-level methods directly model trajectory reward... but the high variance... demands substantially more compute."

Hybrid methods (Section 4.3) address practical bottlenecks:

Compute efficiency. Fast OPD (Zhang et al., 2026a) observes that "training signals are concentrated in the prefix of each output" and truncates both sampling horizon and loss computation, reducing training FLOPs by "2× to 47×" while matching full OPD performance. Speculative KD (Xu et al., 2025c) uses student generations as speculative drafts that the teacher verifies in parallel.

Capacity gap bridging. When teacher-student capacity ratio exceeds 10×, direct matching causes mode averaging. TAID (Shing et al., 2025) constructs time-dependent intermediate distribution \(P_t = (1-\lambda_t)P_\theta + \lambda_t P_T\) that gradually approaches the teacher. Veto (Jang et al., 2026) uses a geometric bridge in logit space with parameter \(\beta\) serving as both Adaptive Gradient Veto and Decisiveness Knob.

Curriculum design. PACED (Xu et al., 2026) proves through gradient signal-to-noise ratio analysis that distillation gradient SNR vanishes when student pass rate approaches 0 or 1. It implements Beta kernel weighting \(w(p) = p^\alpha(1-p)^\beta\) that "assigns maximum weight to sequences near the student's competence frontier." AdaSwitch (Peng et al., 2025) introduces adaptive switching at the token level: "once divergence exceeds a context-aware threshold, teacher guidance is selectively integrated."

Black-Box Methods¶

When teacher logits are inaccessible (Section 5.1), methods construct supervision through adversarial or preference-based signals.

GAD (Ye et al., 2025) casts distillation as a minimax game: a student generator \(G\) produces responses while a discriminator \(D\) distinguishes student from teacher outputs using a Bradley-Terry preference model. The generator maximizes \(V(G, D)\) via REINFORCE using discriminator scores as rewards.

Lion (Jiang et al., 2023) implements a three-stage adversarial loop: imitation, discrimination (teacher identifies student weaknesses), and generation (teacher creates harder instructions). Lion-13B achieves competitive performance on BIG-Bench Hard and AGIEval using "only 70k training examples."

OVD (Xiong et al., 2026) replaces logit matching with "trajectory matching using discrete verbal scores (0–9) from teacher models," eliminating vocabulary alignment requirements and delivering "up to +12.9% absolute EM improvement on web QA."

ORPO-Distill (Singh et al., 2025) formulates cross-architecture distillation as preference optimization, contrasting teacher-generated (preferred) and student-generated (dispreferred) traces via Odds-Ratio Preference Optimization.

Self-Distillation Methods¶

When no external teacher exists (Section 5.2), models bootstrap their own capabilities.

SPIN (Chen et al., 2024) formulates self-distillation as a two-player game: at iteration \(t\), the updated model \(p_{\theta_{t+1}}\) is trained to distinguish the previous iteration's generations from human-written responses. The loss:

\[\mathcal{L}_{\text{SPIN}} = \mathbb{E}_{x,y \sim p_{\text{data}}, y' \sim p_{\theta_t}}\left[\ell\left(\lambda\left(\log \frac{p_{\theta_{t+1}}(y|x)}{p_{\theta_t}(y|x)} - \log \frac{p_{\theta_{t+1}}(y'|x)}{p_{\theta_t}(y'|x)}\right)\right)\right]\]

SPIN provides a theoretical convergence guarantee: the global optimum is achieved when \(p_\theta = p_{\text{data}}\). Starting from Zephyr-7B-SFT, it improves MT-Bench from 5.94 to 6.78 over 3 iterations, with diminishing returns.

The saturation ceiling. The survey identifies a critical failure mode: "Because the student optimizes against a target sharing its own inductive biases and architectural limitations, the hypothesis space progressively collapses." Once the model becomes highly confident in flawed reasoning, gradients vanish and exploration ceases. Two strategies address this: (1) external verifiers (code execution, math verifiers) that shatter self-reinforcing loops, and (2) the distillation-RL loop with periodic reward model refreshing.

Privileged information self-distillation. OPSD (Zhao et al., 2026) lets a single model serve dual roles: teacher conditions on both problem \(x\) and ground-truth answer \(y^*\), while student conditions only on \(x\). The teacher provides dense token-level supervision on student-generated sequences. On AIME 2024/2025 and HMMT, "OPSD matches or exceeds Group Relative Policy Optimization (GRPO) at 4B and 8B scales while using an order of magnitude fewer generated tokens."

GATES (Stein et al., 2026) extends this to settings without ground-truth labels: a single model acts as tutor (conditioned on source document) and student (answering from question alone). A consensus-based gating mechanism suppresses gradients when tutor uncertainty is high.

Reasoning compression. OPSDC (Sang et al., 2026) applies on-policy self-distillation where the same model serves as both teacher (prompted to "be concise") and student, reducing chain-of-thought token count by "57–59% on MATH-500 while simultaneously improving accuracy by 9–16 percentage points."

Reasoning Distillation¶

Section 6 examines how OPD transfers chain-of-thought reasoning. The objective becomes:

\[\mathcal{L}_{\text{CoT-OPD}}(\theta) = \mathbb{E}_{x \sim D, r^S \sim P_\theta(\cdot|x)}\left[\sum_{t=1}^{|r^S|} D_{KL}(P_\theta(\cdot|x, r^S_{<t}) \| P_T(\cdot|x, r^S_{<t}))\right]\]

where \(r = (r_1, r_2, \ldots, r_T)\) is the sequence of intermediate reasoning tokens. Reverse KL is mode-seeking, ensuring "the student reinforces reasoning paths that are highly probable under its own parameterization, provided they are validated by the teacher."

SuperCorrect (Yang et al., 2025b) introduces a two-stage framework: (1) extracting hierarchical thought templates from the teacher, and (2) cross-model collaborative DPO where the teacher provides correction traces for student errors. SuperCorrect-7B surpasses DeepSeekMath-7B by "7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K."

Reward-guided distillation. RLKD (Xu et al., 2025b) proposes a Generative Structure Reward Model (GSRM) that measures structural alignment between student and teacher reasoning:

\[\max_\theta J(\theta) = \mathbb{E}_{x \sim D, y \sim P_\theta(\cdot|x)}\left[R_T(x, y) - \beta D_{KL}(P_\theta(\cdot|x) \| P_T(\cdot|x))\right]\]

LUFFY (Yan et al., 2025) extends GRPO to a Mixed-Policy objective incorporating off-policy reasoning traces (e.g., from DeepSeek-R1) alongside on-policy rollouts, achieving "an average +6.4 point gain over standard RLVR methods on six math benchmarks."

DeepSeek-R1 off-policy analysis. The survey dedicates substantial attention to DeepSeek-R1, which used strictly off-policy distillation: 800,000 samples curated offline, student models fine-tuned via standard SFT with "no on-policy student rollouts, no logit matching, no iterative teacher-student interaction."

The survey identifies three factors explaining its effectiveness despite being off-policy: (1) "data quality dominates data distribution"—the 800K samples contain extremely high-quality reasoning with step-by-step verification and self-correction; (2) reasoning traces are inherently self-correcting, containing backtracking and multi-path exploration; (3) the student's task is memorization of structure, not exploration—mathematical reasoning has relatively few valid paths.

Empirical results: "R1-Distill-Qwen-1.5B scores 28.9% on AIME 2024, already surpassing GPT-4o (9.3%) and Claude-3.5-Sonnet (16.0%). R1-Distill-Qwen-7B reaches 55.5%, R1-Distill-Qwen-14B achieves 69.7%."

The survey notes the off-policy ceiling: at 32B scale, direct GRPO on Qwen2.5-32B-Base achieves only 47.0% on AIME 2024, while R1-Distill-Qwen-32B reaches 72.6%—"a gap of 25.6 percentage points."

4. Key Insights and Innovations¶

Innovation 1: The Unified f-Divergence Framework¶

The survey's most fundamental theoretical contribution is demonstrating that methods previously studied in isolation—GKD, MiniLLM, DistiLLM—instantiate the same underlying objective with different parameterizations. This is not merely taxonomic but analytical: it provides a common mathematical language showing that:

GKD uses mixture sampling \(\pi_{\text{mix}} = \lambda p_\theta + (1-\lambda)D\) with flexible divergence, defaulting to JSD.
MiniLLM uses teacher-mixed sampling \(\pi_{\text{mix}} = (1-\alpha)p_\theta + \alpha p_T\) with Reverse KL.
DistiLLM uses pure on-policy sampling \(p_\theta\) with skewed KL targets \(\tilde{p} = \alpha p_T + (1-\alpha)p_\theta\).

This unification reveals that the field has been systematically exploring a continuous design space governed by sampling policy, divergence generator, and argument ordering—a space that was previously implicit and unstructured.

Innovation 2: Adaptive Divergence Selection as a Core Principle¶

The survey establishes that adaptive divergence methods (ToDi, Entropy-Aware OPD) consistently outperform fixed divergences, and provides the theoretical explanation: "At reasoning-critical tokens, the teacher's distribution is typically unimodal... and Reverse KL aligns well. At generation-flexible tokens... the distribution is near-uniform over synonyms, and Forward KL preserves this diversity."

This insight transforms the field's question from "which divergence?" to "which divergence where?"—a qualitative shift in how distillation objectives should be designed. The geometric intuition from Figure 1 (mode-covering vs. mode-seeking behavior) combined with the entropy-based analysis provides both theoretical grounding and practical guidance.

Innovation 3: The Privileged Information Paradigm for Self-Distillation¶

The survey synthesizes OPSD, GATES, and related work into a coherent paradigm: using training-time information (ground-truth answers, source documents) that is unavailable at test time to generate higher-quality supervision. This "breaks the saturation ceiling" that constrains pure self-play by injecting fresh information into the loop.

The key insight is that privileged information reduces teacher entropy "sufficiently to provide meaningful supervision, but not so much that the teacher's distribution becomes degenerate." This paradigm is shown to be particularly effective "in hard, long-horizon RL settings where the base model achieves near-zero solve rates without PI."

Innovation 4: The Distillation-RL Unification¶

The survey documents a fundamental convergence: "the boundary between distillation and reinforcement learning is rapidly dissolving." Methods like G-OPD prove that OPD is a special case of KL-constrained RL; RLAD shows that selective imitation outperforms both offline distillation and pure RL; the unified framework of Li et al. (2025) shows KD and RL contribute complementary gradients—dense teacher supervision stabilizes early training while RL enables exploration beyond the teacher.

This unification has practical implications: the optimal training signal combines both. The survey's "distillation tax" budget rule recommends "60–70% of the training budget to off-policy warm-up, 20–30% to on-policy logit distillation, and 10% to reward-guided refinement."

Innovation 5: Quantifying the Compute-Quality Tradeoff¶

Section 7.3 provides concrete cost analysis missing from prior work. The survey formalizes expected compute cost:

\[C_{\text{off}} \approx N \times (F_{\text{teacher}} + F_{\text{student}} + B_{\text{student}})\]

\[C_{\text{on}} \approx N \times (G_{\text{student}} + \lambda F_{\text{teacher}} + F_{\text{student}} + B_{\text{student}})\]

where \(G\) is autoregressive generation cost (scaling quadratically with sequence length) and \(\lambda \in (0, 1]\) is the teacher supervision refresh rate. Because \(G_{\text{student}} \gg F_{\text{student}}\), \(C_{\text{on}}\) is typically "several times" \(C_{\text{off}}\).

A concrete example: "Off-policy over 1B tokens... totaling ~300 GPU-hours. On-policy over 1B tokens... yielding ~1,200–1,500 GPU-hours, a 4–5× overhead." This quantification enables principled budget allocation rather than trial-and-error.

5. Experimental Analysis¶

Evaluation Methodology¶

As a survey, this paper does not conduct new experiments but synthesizes results from the surveyed literature. The survey aggregates:

Datasets. Mathematical reasoning: GSM8K, MATH, MATH-500, AIME 2024, AIME 2025, HMMT. Code generation: HumanEval, LiveCode. General instruction following: MT-Bench, AlpacaEval, Dolly, S-NI. Summarization: XSum, CNN/DM. Translation: WMT, Europarl.

Metrics. Pass@1 and Pass@k for reasoning benchmarks; accuracy for MMLU and domain-specific benchmarks; ROUGE-L for summarization; win rates on MT-Bench and AlpacaEval.

Model families. Qwen2.5/Qwen3 (0.6B–32B), Llama-3.1/3.2 (1B–70B), Gemma-2 (2B–27B), GPT-2, OPT, DeepSeek-R1 distilled variants, Mistral, and proprietary models (GPT-4, Claude).

Main Quantitative Results¶

Exposure bias mitigation. The survey cites Gudibande et al. (2023) demonstrating that "off-policy distilled students can degrade sharply on tasks requiring sustained multi-step generation," validating the theoretical exposure bias analysis.

Adaptive divergence superiority. Entropy-Aware OPD achieves "+5.05 Pass@8 improvement over standard Reverse KL on mathematical reasoning" on Qwen3-4B-Base. ToDi and AKL consistently outperform fixed divergence baselines, validating the adaptive divergence principle.

Reasoning compression. OPSDC reduces chain-of-thought token count by "57–59% on MATH-500 while simultaneously improving accuracy by 9–16 percentage points." On AIME 2024, "the 14B variant gains +10 points over the uncompressed baseline with 41% compression."

DeepSeek-R1 off-policy distillation. The most striking results come from strictly off-policy R1 distillation: - AIME 2024 (pass@1): R1-Distill-Qwen-1.5B: 28.9% > GPT-4o (9.3%), Claude-3.5-Sonnet (16.0%) - R1-Distill-Qwen-7B: 55.5%; R1-Distill-Qwen-14B: 69.7%; R1-Distill-Qwen-32B: 72.6% - MATH-500: R1-Distill-Qwen-7B: 92.8%; R1-Distill-Qwen-32B: 94.3%

Distillation vs. direct RL. At 32B scale, "applying GRPO directly to Qwen2.5-32B-Base... achieves only 47.0% on AIME 2024... In contrast, R1-Distill-Qwen-32B reaches 72.6%, a gap of 25.6 percentage points."

Self-distillation limits. SPIN improves MT-Bench from 5.94 to 6.78 over 3 iterations "with diminishing returns at each round." The survey notes that without external verifiers or privileged information, self-play "cannot improve beyond the quality ceiling of the SFT data."

Black-box methods. Lion-13B achieves competitive performance on BIG-Bench Hard and AGIEval using "only 70k training examples." OVD delivers "up to +12.9% absolute EM improvement on web QA and +25.7% on math benchmarks."

Industrial systems. Nemotron-Cascade 2 (30B MoE with 3B active) achieves "Gold Medal-level performance on IMO, IOI, and ICPC World Finals" with "20× fewer parameters" than comparable models.

Ablation Studies and Robustness Checks¶

The survey cites several key ablations:

PACED gradient SNR analysis. Proves that "the distillation gradient SNR vanishes at both extremes of student capability: when the student's pass rate on a problem approaches 0 (too hard) or 1 (too easy)." The Beta kernel weighting is "provably minimax-robust, with worst-case efficiency loss of only \(O(\delta^2)\)."

Failure modes of token-level OPD. Fu et al. (2026) identify three failure modes: "(1) an imbalanced one-token signal that reduces distribution matching to a single sample, (2) unreliable teacher guidance on student-generated prefixes that deviate from the teacher's training distribution, and (3) distortions caused by tokenizer or special-token mismatch."

ReST\(^{EM}\) degradation. Attempting to optimize revision models with RL-style training caused performance to "degrade substantially with sequential revisions," likely because on-policy data collection "amplified spurious correlations."

AKL theoretical analysis. Wu et al. (2025) demonstrate that "the mode-seeking and mode-covering characterizations of Reverse and Forward KL do not strictly hold for discrete distributions in LLMs, and that both divergences converge to the same objective given sufficient training epochs."

Assessment: Do the Experiments Support the Claims?¶

Claim 1: On-policy training eliminates exposure bias and reduces compounding error from \(O(\epsilon T^2)\) to \(O(\epsilon T)\). Strongly supported theoretically through the DAgger derivation. Empirically supported through aggregated results showing consistent gains for on-policy methods over off-policy baselines on reasoning and multi-step generation tasks.

Claim 2: Adaptive divergence methods outperform fixed divergences. Strongly supported. The survey cites consistent improvements from ToDi, Entropy-Aware OPD, and related methods across multiple benchmarks, with specific numbers (+5.05 Pass@8 improvement for Entropy-Aware OPD).

Claim 3: Off-policy R1 distillation works due to data quality, but on-policy methods can push further. Supported with nuance. The survey provides compelling evidence for why off-policy works in R1's specific case (data quality, self-correcting traces, structured reasoning) and cites subsequent work (OPSD, PACED, RLKD) showing on-policy methods applied post-distillation yield further gains.

Claim 4: The unified framework provides analytical power. Supported through the systematic mapping of methods onto Equation 14 and the geometric analysis of divergence choices.

Potential Weaknesses¶

No new experiments. As a survey, the paper synthesizes existing results rather than conducting controlled comparisons. Some claims (e.g., the compute-quality tradeoff) rely on theoretical analysis rather than empirical validation.

Indirect comparisons. Methods are evaluated across different papers with different experimental setups, making direct comparison difficult. The survey mitigates this by citing specific benchmark results, but standardized evaluation protocols are lacking.

Difficulty estimation cost unaccounted. Methods like PACED that estimate student competence boundaries require additional computation that isn't always included in cost comparisons. The survey acknowledges this as an open problem.

6. Limitations and Trade-offs¶

Assumption: The Teacher Provides Reliable Supervision on Student-Generated States¶

The entire OPD framework rests on a critical assumption inherited from imitation learning theory: when the student queries the teacher on self-generated states, the teacher provides meaningful, calibrated supervision. The survey explicitly flags this assumption's fragility in Section 2.3:

"If the student hallucinates a severely out-of-distribution prefix, the teacher's conditional distribution may itself become poorly calibrated. Forcing the student to match this noisy distribution violates the core assumption of interactive imitation learning, potentially destabilizing training rather than recovering the \(O(\epsilon T)\) bound."

This is not merely theoretical. When a student generates a prefix containing factual errors or logical inconsistencies, the teacher—conditioned on this erroneous context—may assign high probability to hallucinated continuations that "make sense" given the flawed premise. The student, dutifully minimizing divergence against this distribution, learns to confidently produce nonsense. This "echo chamber" effect, mentioned in Sections 5.3 and 8, represents a fundamental failure mode that naive OPD does not address.

The survey cites several partial mitigations—Entropy-Aware OPD gates divergence choice based on teacher uncertainty; GATES uses consensus-based gating to suppress gradients when the tutor disagrees with itself; PACED focuses training on the competence frontier where supervision is most reliable. However, none of these methods explicitly model the difference between epistemic uncertainty (the teacher genuinely doesn't know, and the signal should be attenuated) and aleatoric uncertainty (genuine task ambiguity that should be preserved). The survey identifies this as an "important open challenge" (Section 8).

Compute Overhead: The 4–5× Training Cost Multiplier¶

The most severe practical limitation is computational. Section 7.3 provides concrete cost analysis that practitioners cannot ignore:

"Off-policy over 1B tokens, the teacher generates the dataset offline (~200 GPU-hours), the student trains for ~100 GPU-hours, totaling ~300 GPU-hours. On-policy over 1B tokens... yielding ~1,200–1,500 GPU-hours, a 4–5× overhead."

This overhead stems from autoregressive generation cost \(G_{\text{student}}\) scaling quadratically with sequence length due to KV cache updates, combined with the need for teacher forward passes on dynamically generated student tokens. The survey notes that "Because \(G_{\text{student}} \gg F_{\text{student}}\), \(C_{\text{on}}\) is typically several times \(C_{\text{off}}\)."

The implications are stark: for teams with fixed GPU budgets, the 4–5× overhead means training on 4–5× fewer tokens or spending 4–5× more compute. The survey recommends a staged approach—"60–70% of the training budget to off-policy warm-up, 20–30% to on-policy logit distillation, and 10% to reward-guided refinement"—but acknowledges this is a heuristic rather than a principled allocation. The absence of distillation scaling laws (Section 8) means practitioners currently rely on "costly trial-and-error to determine how much on-policy generation budget to allocate relative to student and teacher size."

GPU Memory Constraints for White-Box OPD¶

Beyond FLOPs, Section 7.3 identifies a memory bottleneck that may be more constraining for many teams:

"Holding a 70B teacher alongside a 7B student requires the teacher's weights (~140 GB in BF16), the student's weights and optimizer states (~84 GB), and, critically, the KV cache for generation and the full-vocabulary logits tensor ([B, T, |V|]) for distillation. Peak memory can easily exceed an 8×80 GB H100 node."

The survey cites practical mitigations—teacher quantization (FP8 or INT4 for 2–4× memory reduction), logit offloading to CPU or retaining only top-k, aggressive gradient checkpointing—but notes that "combining these techniques is essential for viable white-box OPD on standard clusters." This creates a barrier to entry: teams without access to specialized infrastructure (e.g., H100 nodes) may find white-box OPD simply infeasible regardless of its theoretical benefits.

The Self-Play Saturation Ceiling¶

Section 5.3 provides an unvarnished analysis of self-distillation's fundamental limitation:

"Because the student optimizes against a target sharing its own inductive biases and architectural limitations, the hypothesis space progressively collapses. If the model discovers a syntactic 'hack' or a highly confident but flawed reasoning heuristic, no external signal penalizes it. The model self-reinforces this flawed trajectory, driving \(p_\theta(y_{\text{flawed}}|x) \to 1\). Once entirely certain of its own hallucinations, gradients vanish, exploration ceases, and the policy is trapped."

This is distinct from exposure bias—it arises not from train-test distribution mismatch but from the absence of distributional diversity. SPIN's empirical results validate this: MT-Bench improves from 5.94 to 6.78 over 3 iterations "with diminishing returns at each round." The survey is explicit that without external verifiers or privileged information, self-play "cannot improve beyond the quality ceiling of the SFT data."

Two strategies address this—external verifiers (code execution, math verifiers) and the distillation-RL loop—but both require additional infrastructure and compute that may not be available in all deployment scenarios.

Vocabulary Mismatch and Cross-Architecture Distillation¶

Section 4.1.3 highlights an assumption increasingly violated in practice: "Conventional white-box KD assumes that teacher and student share the same vocabulary and representation space, an assumption increasingly violated in practice (e.g., distilling Llama into Qwen)."

DSKD addresses this through dual projectors mapping hidden states between representation spaces, but "approximately doubles memory overhead." Delta KD preserves distributional shifts rather than absolute outputs, but "requires both pre-trained and fine-tuned model checkpoints." Cross-tokenizer KD uses optimal transport but operates at the logit level where "the vocabulary bottleneck has already compressed the teacher's internal representations."

The survey identifies latent space distillation as a promising direction, but notes that "extending latent distillation to on-policy settings... introduces additional challenges around representation drift during training" (Section 8). This remains an open problem with significant practical implications as teams increasingly need to distill across model families.

Agentic and Long-Horizon Settings Are Largely Unaddressed¶

Section 8 explicitly flags a critical gap:

"Current OPD methods overwhelmingly target single-turn generation or linear chains of thought, yet LLMs are increasingly deployed as autonomous agents interacting with external environments such as APIs, terminals, databases, and web browsers."

In agentic settings, three fundamental challenges arise that existing OPD methods do not address:

Environment non-stationarity: "Unlike text generation where the 'environment' (the prompt) is fixed, agentic environments change in response to actions, making direct trajectory comparison between teacher and student meaningless unless the teacher provides counterfactual evaluations conditioned on the student's actual state."
Tool-use combinatorics: "Modern agent frameworks expose dozens of tools with structured argument schemas, suggesting that distillation may need to operate at the tool-call level rather than individual tokens."
Safety-critical credit assignment: "In coding agents, a single incorrect file write can corrupt a repository. In web-browsing agents, a misguided click can trigger irreversible actions... Future distillation frameworks must incorporate safety constraints that prevent the student from exploring catastrophically dangerous action sequences during on-policy generation."

SCoRe provides an early realization (correcting only the earliest error in multi-step trajectories), and OEL extends context distillation to interactive environments, but the survey acknowledges these are "initial results" and that "developing OPD methods that balance exploration with safety in agentic settings remains a largely uncharted territory."

Uncertainty-Aware Distillation Is Underexplored¶

Section 8 identifies a critical gap in current methods:

"White-box OPD assumes that the teacher's logit distribution provides a reliable, high-quality supervisory signal at every token position. In practice, this assumption frequently breaks down. Teacher models suffer from overconfidence, hallucination, and poor calibration, especially when evaluated on out-of-distribution prefixes generated by an exploring student."

While Entropy-Aware OPD, GATES, and PACED partially address this through gating mechanisms, "none of these methods explicitly model the teacher's epistemic versus aleatoric uncertainty." The survey argues that epistemic uncertainty (teacher ignorance) should attenuate the distillation loss, while aleatoric uncertainty (genuine ambiguity) should be preserved in the student's output distribution. Achieving this decomposition efficiently "without requiring expensive Bayesian inference over the teacher's parameters" remains an open challenge.

Evaluation Beyond Benchmarks: The Contamination Problem¶

The survey briefly but importantly notes (Section 8) that "Standard benchmarks (e.g., MMLU, GSM8K) suffer from data contamination and measure only static pattern matching, failing to capture the intrinsic robustness, out-of-distribution generalization, and hallucination rates of a distilled student relative to its teacher."

Fu et al. (2026) demonstrate that "token-level OPD can appear successful on standard benchmarks while failing catastrophically on distribution-shifted prompts." This raises a fundamental concern: the reported benchmark improvements may overstate true capability transfer. The survey calls for "dynamic, adversarial testing that probes whether the student has learned the underlying causal mechanism of reasoning rather than superficial correlations," but acknowledges that developing standardized evaluation suites for this "would provide more reliable guidance for method selection and training decisions."

The Distillation Scaling Law Gap¶

Perhaps the most consequential limitation for practitioners is flagged in Section 8:

"Chinchilla scaling laws provide principled compute allocation for pre-training, and Busbridge et al. (2025) recently extended this framework to off-policy knowledge distillation... However, no equivalent scaling law exists for on-policy distillation, where the generation cost of student rollouts introduces an additional compute axis that fundamentally changes the optimization landscape."

The survey proposes a conjectured functional form:

\[L(N_S, N_T, D_{\text{on}}) = E + \frac{A}{N_S^\alpha} + \frac{B}{N_T^\beta} + \frac{C}{D_{\text{on}}^\gamma} + f(N_S, N_T)\]

where \(E\) is irreducible task entropy and \(f(N_S, N_T)\) models capacity-gap interference. But the survey explicitly states: "While no study has yet validated this functional form."

The practical consequence: practitioners cannot answer basic allocation questions like "Given a fixed GPU budget, should one distill from a 70B teacher for 1B tokens or a 405B teacher for 200M tokens?" without costly trial-and-error. This transforms what could be a principled engineering decision into an empirical gamble.

7. Implications and Future Directions¶

How This Survey Changes the Landscape¶

Establishing OPD as a Distinct Paradigm. Prior to this survey, existing literature "predominantly organize[d] the field around the classical compression framing, treating off-policy and on-policy approaches as interchangeable variants rather than as fundamentally distinct paradigms with different theoretical guarantees and failure modes" (Section 1). This survey establishes OPD not as a minor variant but as a paradigm shift—from static dataset learning to dynamic, interactive training with its own theoretical foundations (imitation learning), failure modes (echo chambers, saturation), and design principles (adaptive divergences, privileged information).

Providing a Common Mathematical Language. The unified \(f\)-divergence framework (Equation 14) does more than taxonomize—it enables direct comparison of methods that were previously studied in isolation. Researchers can now ask "how does GKD's mixture sampling compare to DistiLLM's skewed targets under the same divergence?" rather than treating them as incomparable heuristics. This analytical clarity accelerates research by revealing which dimensions of the design space are underexplored.

Bridging Distillation and Reinforcement Learning. The survey documents a fundamental convergence: G-OPD proves OPD is a special case of KL-constrained RL; RLAD shows selective imitation outperforms both pure distillation and pure RL; the hybrid framework of Li et al. (2025) shows KD and RL contribute complementary gradients. This unification dissolves a false dichotomy—practitioners need not choose between "distill then RL" or "RL then distill." The optimal training signal combines dense teacher supervision (stability) with sparse outcome rewards (exploration beyond the teacher).

Reconciling Contradictory Prior Findings. The survey explains why prior work reached opposite conclusions about self-correction and test-time computation effectiveness. Huang et al. (2023) found LLMs cannot self-correct reasoning; other work found self-refinement helps. The survey's exposure bias analysis reveals the missing variable: whether training exposes the student to its own error states. This transforms a confusing set of contradictory results into a coherent picture with clear boundary conditions.

Follow-Up Research This Work Enables¶

1. Distillation Scaling Laws for On-Policy Settings. The most pressing gap is the absence of scaling laws. The survey proposes a conjectured functional form but notes it remains unvalidated. A systematic empirical study—conducting controlled grid searches that independently vary student size \(N_S\), teacher size \(N_T\), and on-policy data volume \(D_{\text{on}}\) across multiple benchmarks—would yield enormous practical value. The DeepSeek-R1 results hint at the relationships: AIME 2024 performance scales as 28.9% → 55.5% → 69.7% → 72.6% for student sizes 1.5B → 7B → 14B → 32B, with "the steepest gain between 1.5B and 7B (26.6% absolute), while the 14B→32B jump yields only 2.9%" (Section 8). This suggests diminishing returns with student scale that scaling laws could formalize.

2. Uncertainty-Aware Distillation Objectives. The survey identifies that current methods do not decompose teacher uncertainty into epistemic and aleatoric components. A promising direction is to train an ensemble of teacher variants (via dropout, checkpoints, or explicit ensembling) and use the variance across ensemble members as an epistemic uncertainty estimate. When variance is high, attenuate the distillation loss; when variance is low but entropy is high, preserve the full distribution (genuine ambiguity). This would directly address the echo chamber problem where students confidently mimic uncertain teacher outputs.

3. Dynamic Curriculum Distillation. PACED demonstrates that Beta kernel weighting focused on the competence frontier improves sample efficiency. A natural extension is fully dynamic curriculum scheduling: continuously estimate each prompt's difficulty (via teacher-student divergence or pass rate), track the student's evolving competence boundary, and allocate compute preferentially to prompts in the "zone of proximal development." The interaction between curriculum scheduling and divergence adaptation (Forward KL early for coverage, Reverse KL later for consolidation) further expands the design space. The survey notes that "scaling this principle to large prompt pools—automatically estimating prompt difficulty, tracking the student's evolving competence boundary, and efficiently allocating compute across difficulty tiers—is ripe for exploration."

4. Latent Space Distillation for Cross-Architecture Transfer. DSKD's dual-space projectors and cross-tokenizer KD's optimal transport address vocabulary mismatch at the output level, but "a more ambitious direction is to bypass the vocabulary layer entirely and distill in latent space, matching the geometric structure of the teacher's hidden-state manifold rather than its token-level predictions" (Section 8). This would enable seamless cross-architecture OPD where teacher and student differ in hidden dimension, number of layers, and attention structure. The key challenge is extending latent alignment to on-policy settings where "the student's rollouts produce hidden states that must be aligned with the teacher's counterfactual representations."

5. Agentic Distillation. The survey explicitly flags this as "largely uncharted territory" (Section 8). Three research directions are suggested:

Environment-aware credit assignment: Instead of matching token-level distributions, develop objectives that match outcome distributions under environment dynamics. When the student takes action \(a\) in state \(s\), the teacher should evaluate not just \(p_T(a|s)\) but the expected value of the resulting trajectory.
Tool-call level distillation: For agents with tool access, operate at the level of structured tool invocations rather than individual tokens. This would capture the semantic content of tool calls while allowing syntactic flexibility in how they're expressed.
Safety-constrained exploration: Incorporate constraints that prevent catastrophic actions during training. This could take the form of learned safety critics that veto dangerous rollouts, or formal verification methods that bound worst-case harm.

6. Multimodal On-Policy Distillation. VOLD, Video-OPD, and X-OPD provide initial results for vision-language models, video grounding, and speech, but the survey notes "several questions remain":

"How should the distillation objective account for information asymmetry between modalities (e.g., when the teacher processes text but the student processes images of the same content)? Can on-policy distillation transfer spatial reasoning, temporal grounding, and audio understanding simultaneously, or does each modality require specialized objectives?"

The interaction between modality-specific data scarcity and on-policy generation is particularly promising, as OPD's self-correcting dynamics could partially compensate for limited annotated multimodal reasoning data.

7. Closing the Distillation-RL Loop. The survey argues that "the true performance ceiling likely lies in closing the loop, alternating between KD-driven stabilization and RL-driven exploration" (Section 8). Specific research questions include:

When should the optimization switch from distillation to RL? Possible triggers include distillation gradient magnitude falling below a threshold, or the student's performance plateauing on held-out data.
How can RL-discovered capabilities be preserved during subsequent distillation phases without being overwritten? Methods like experience replay from RL trajectories, or KL penalties against moving too far from the RL-optimal policy, could help.
What are the convergence properties of alternating optimization? Proving contraction bounds for such hybrid systems would ground empirical practice in theory.

Practical Applications and Downstream Use Cases¶

Instruction-Following and Chat. For general instruction-following, the survey recommends starting with off-policy SFT on teacher-generated data as "the most cost-effective starting point." If the student exhibits clear exposure bias—"performing well on teacher-style prompts but failing on user-generated ones with different phrasing"—the transition to on-policy distillation is justified (Section 8). DistiLLM and GKD provide the best quality-compute tradeoff in the white-box regime.

Mathematical Reasoning. The DeepSeek-R1 results demonstrate that off-policy distillation from a sufficiently capable teacher can transfer reasoning capabilities effectively—the R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing GPT-4o (9.3%) and Claude-3.5-Sonnet (16.0%). However, to push beyond the off-policy ceiling, the survey recommends hybrid approaches: off-policy warm-up followed by on-policy refinement using OPSD (which "matches or exceeds GRPO at 4B and 8B scales while using an order of magnitude fewer generated tokens") or reward-guided methods like RLKD.

Reasoning Compression. For deployment scenarios where inference cost matters, OPSDC demonstrates that verbosity in distilled reasoning models is a "trainable inefficiency rather than a necessary cost"—reducing chain-of-thought length by 57–59% while improving accuracy by 9–16 percentage points on MATH-500. This has direct implications for latency-sensitive applications.

Cross-Architecture Distillation. When teacher and student use different tokenizers or architectures, DSKD provides a viable path despite its ~2× memory overhead. For simpler vocabulary mismatches, cross-tokenizer KD's optimal transport approach may suffice. The survey notes that these methods are "orthogonal to the sampling strategy and can be integrated into on-policy rollouts."

Proprietary Model Distillation (Black-Box). When only API access to the teacher is available, GAD enables effective on-policy learning through adversarial formulations, and OVD demonstrates that "coarse verbal feedback can outperform dense logit matching when combined with on-policy exploration"—delivering up to +12.9% absolute EM improvement on web QA. Lion provides a curriculum-driven approach requiring "only 70k training examples" for competitive performance.

Self-Improvement Without External Teachers. When no superior teacher exists, OPSD and GATES provide paradigms for using privileged information (ground-truth answers, source documents) to bootstrap supervision. SDPO demonstrates that rich textual feedback (compiler errors, test outputs, proof checker messages) can substitute for scalar rewards, enabling self-distillation in domains with executable verification but no reward model.

When to Prefer OPD Over Alternatives¶

Prefer on-policy distillation when: - The task involves multi-step reasoning where compounding errors degrade quality (Section 8 explicitly states: "the transition to on-policy distillation is justified when... the task involves multi-step reasoning"). - The student exhibits exposure bias (performing well on teacher-style prompts but failing on user-generated ones). - The student needs to exceed the teacher in a specific domain via reward-guided exploration (G-OPD's ExOPD). - High-quality reasoning traces are available but static datasets cannot cover the combinatorial space of valid solution paths.

Prefer off-policy distillation when: - The task is general instruction-following where exposure bias is less severe. - Compute budget is limited and the 4–5× overhead of on-policy training is prohibitive. - High-quality teacher data is already available (as in DeepSeek-R1's 800K curated samples). - The student's task is memorization of structure rather than exploration of alternatives.

For the hybrid approach (generally recommended): The survey's practical guideline is explicit: "60–70% of the training budget to off-policy warm-up, 20–30% to on-policy logit distillation, and 10% to reward-guided refinement." This staged approach "front-loads cheap, high-bandwidth learning and reserves expensive on-policy compute for the final quality push."

Looking Ahead¶

The survey concludes with three "most pressing open problems" that frame the field's future:

Distillation scaling laws to transform compute allocation from trial-and-error to principled engineering.
Uncertainty-aware objectives that prevent echo chamber effects when students explore out-of-distribution regions.
Agentic distillation where training involves entire interaction trajectories rather than single completions.

The survey anticipates "convergence toward unified KD+RL frameworks that treat distillation not as a separate training stage but as a continuous regularization mechanism throughout the model's lifecycle, from pre-training through deployment." This vision—distillation as continuous capability refinement rather than one-shot transfer—represents a qualitative shift in how LLM post-training will be conceptualized and implemented.