Skip to content

Black-Box On-Policy Distillation of Large Language Models

ArXiv: 2511.10643

🎯 Pitch

The paper introduces Generative Adversarial Distillation (GAD), a black-box, on-policy distillation method that trains a student LLM as a generator against a learned discriminator which simultaneously serves as an adaptive reward model for reinforcement learning. By enabling on-policy learning from teacher-generated text alone (no logits or parameters), GAD reliably extracts deeper behavioral knowledge from proprietary teachers, significantly outperforming sequence-level distillation and improving out-of-distribution generalization—making high-quality LLM compression feasible when only API access is available.


1. Executive Summary (2-3 sentences)

This paper introduces Generative Adversarial Distillation (GAD), a method for black-box (text-only) distillation that enables on-policy training of a student LLM from a proprietary teacher without access to teacher logits or parameters (Section 1, Section 2; Figure 2; Eq. (1)–(3)). The core idea is to train a student as a generator against a learned discriminator that distinguishes teacher vs. student outputs, where the discriminator simultaneously serves as an adaptive on-policy reward model for reinforcement learning updates (Section 2.1–2.3). Across multiple student model families and several test sets, GAD outperforms standard sequence-level distillation (SeqKD) and shows stronger out-of-distribution gains, with the paper highlighting that Qwen2.5-14B-Instruct with GAD becomes comparable to the GPT-5-Chat teacher on their automatic evaluation (Figure 1; Table 2).


2. Context and Motivation

  • Problem / gap addressed
  • Distilling a smaller student LLM from a large teacher is useful for efficiency, but many standard distillation objectives require white-box access to the teacher’s probability distribution (logits) or internal states (Section 1).
  • In practice, proprietary teacher LLMs are often only accessible via an API returning generated text, which defines the black-box distillation setting (Section 1, Section 2).
  • Black-box distillation makes likelihood-based alignment objectives (e.g., KL divergence between distributions) unavailable; it also becomes harder when teacher and student tokenizers differ, complicating token-level objectives (Section 1; also revisited in Appendix B.1 “Qwen2.5 Teacher” scenario and Table 7).

  • Why the problem matters

  • Without a probability-level signal, typical black-box approaches collapse to supervised fine-tuning (SFT) on teacher responses, which may not extract “deeper” behavioral knowledge and may generalize poorly (Section 1; Section 3.2 discussion of OOD generalization).
  • Recent white-box results emphasize on-policy learning (student learns from its own generations) to reduce exposure bias and encourage mode-seeking behavior (Section 1). But black-box settings lack a way to evaluate student-generated outputs against the teacher at the probability level, blocking straightforward on-policy distillation (Section 1).

  • Prior approaches and shortcomings (as positioned here)

  • White-box distillation: forward/reverse KL, hidden-state matching, attention matching, etc. (Related Work, Section 4). Not applicable to API-only teachers.
  • Black-box standard baseline: SeqKD (sequence-level knowledge distillation), essentially SFT on teacher outputs (Section 1; Section 4). The paper argues (and later provides evidence) that this can overfit to local lexical patterns and can show weak or negative OOD improvements (Section 3.2–3.3; Figure 4; Table 2).

  • How this paper positions itself

  • It proposes a framework to make on-policy learning feasible in black-box distillation by creating an implicit feedback mechanism: a discriminator trained to separate teacher vs. student responses, used as a reward model (Section 1; Section 2.1–2.3; Figure 2).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a two-model training setup: a student LLM generates answers, and a discriminator model scores prompt+answer pairs.
  • It solves black-box distillation by turning “match the teacher” into an adversarial game where the student is updated on its own generated outputs using reinforcement learning rewards produced by the discriminator (Section 2.1–2.3; Algorithm 1).

3.2 Big-picture architecture (diagram in words)

  • Inputs: prompts x from an instruction/chat dataset and teacher responses y_t collected once from the black-box teacher (Section 2.1; Section 3.1 “Dataset”).
  • Generator G (student LLM): produces a response G(x) (Section 2.1; Figure 2).
  • Discriminator D: takes concatenated [x, y] and outputs a scalar score D([x, y]) (Section 2.1, footnote 2).
  • Training loop:
  • Update D to score teacher responses higher than student responses using a pairwise preference loss (Bradley–Terry) (Eq. (3)).
  • Update G to increase the discriminator’s score on its own sampled outputs (Eq. (2)), using policy-gradient RL because sampling is non-differentiable (Section 2.2; Appendix A.1).

3.3 Roadmap for the deep dive

  • Explain the objective and why it is a minimax game (Eq. (1)).
  • Detail the discriminator design and its Bradley–Terry preference learning (Section 2.1; Eq. (3)).
  • Detail the generator optimization via RL and the GRPO implementation (Section 2.2–2.3; Appendix A.1; Eq. (5)–(8)).
  • Describe the training procedure, including the warmup stage and why it matters (Section 2.2; Algorithm 1; Table 3).
  • Provide a worked micro-example to make the preference/reward mechanics concrete (based on the toy experiment description, Section 3.3; Figure 5).

3.4 Detailed, sentence-based technical breakdown

This is an algorithmic / training-framework paper with empirical validation; the core idea is to enable black-box on-policy distillation by coupling a student generator with a co-trained discriminator that provides an adaptive reward signal (Section 2; Figure 2).

System/data pipeline diagram in words (explicit sequence)
1. First, a distillation dataset T = {(x, y_t)} is constructed by iterating over prompts x in an instruction dataset and sampling a teacher response y_t for each prompt using the black-box teacher (Section 2.1).
2. Second, for each training batch, the student generator G samples its own response(s) G(x) to each prompt x (Section 2.1–2.2; Algorithm 1).
3. Third, the discriminator D receives both (x, y_t) and (x, G(x)) (implemented as concatenated [x, y]) and produces scalar scores for each prompt-response pair (Section 2.1 and footnote 2).
4. Fourth, D is updated to assign a higher score to the teacher response than to the student response using the Bradley–Terry pairwise loss (Eq. (3); Appendix A.1 Eq. (8)).
5. Fifth, G is updated with reinforcement learning to maximize the discriminator score of its own sampled responses, treating D(G(x)) as a reward (Eq. (2); Section 2.2–2.3).
6. This alternating process repeats until convergence, with an initial warmup stage before adversarial/RL training begins (Algorithm 1; Section 2.2).

Core objective: minimax preference game (what it means and why it helps)
- The paper defines a value function that sets up a two-player minimax game (Eq. (1)): - D is trained to increase the score gap D(y_t) - D(G(x)) so it can distinguish teacher outputs from student outputs. - G is trained to make D(G(x)) large so the discriminator cannot reliably separate student from teacher. - The loss uses -log σ(D(y_t) - D(G(x))), which is a Bradley–Terry style formulation for pairwise preferences (Section 2.1; Eq. (1), Eq. (3)). In plain language: the discriminator learns a scoring function such that “teacher beats student,” and the student learns to produce outputs that “look teacher-like” under that scoring function.

Discriminator D: architecture and loss mechanics
- D is initialized from the generator’s model parameters with an added scalar prediction head (Section 2.1). - The head projects the final hidden state to a scalar; the score of the last token is used as a sequence-level score (Section 2.1).
- Important missing detail: the excerpt does not specify the exact discriminator head architecture beyond “extra prediction head” nor the exact base model architecture hyperparameters (layers/hidden size/heads), so those cannot be enumerated here. - D is optimized with the Bradley–Terry loss that encourages D(y_t) > D(G(x)) (Eq. (3)). The paper later shows this choice matters vs. a cross-entropy discriminator (Table 4; Eq. (4)).

Generator G: why RL is needed and how it is done
- Because G(x) is produced by sampling tokens, the mapping from parameters to the sampled text is non-differentiable; therefore, the paper treats D(G(x)) as a reward and uses policy-gradient RL (Section 2.2). - The paper uses GRPO for policy optimization (Section 2.2–2.3; Appendix A.1). In Appendix A.1: - For each prompt x, sample a group of N student responses {y_s^i} (Appendix A.1). - Compute reward per sample: r_s^i = D(y_s^i) (Eq. (5)). - Compute a normalized advantage per sample using group mean and standard deviation (Eq. (6)): - In plain language: responses are judged relative to other sampled responses for the same prompt, stabilizing learning by centering/scaling rewards within the group. - Optimize the policy objective (Eq. (7)); the text notes they omit the KL regularizer and clip operator “for brevity” when writing Eq. (7), so the exact implemented GRPO objective includes additional elements not fully shown in the main body (Appendix A.1). - The discriminator training pairs each student response in the group with the same teacher response to form preference pairs (y_t, y_s^i) and averages the Bradley–Terry loss across the group (Eq. (8)).

Warmup stage (why it exists and what it does)
- Before adversarial/RL training, the paper performs a 1-epoch warmup where: - G is fine-tuned on teacher responses with cross-entropy (teacher-forcing SFT) (Section 2.2; Algorithm 1). - D is also trained using the Bradley–Terry loss on the same data (Section 2.2; Algorithm 1). - The paper claims this warmup is “crucial” for balance and stability, and provides an ablation (Table 3; Section 3.4) showing drops when removing generator warmup or discriminator warmup.

Interpreting GAD as on-policy reward modeling (RLHF analogy)
- The paper maps components to RL terms (Table 1; Section 2.3): - Policy model = generator/student G. - Reward model = discriminator D. - Reward = D(G(x)) (Eq. (2)). - The key difference vs. “conventional” setups in their framing is that the reward model is not frozen; it co-evolves with the student, which the paper argues reduces reward hacking risk (Section 2.3; Figure 6 in Section 3.3).

Worked micro-example (concrete walk-through of the Bradley–Terry scoring logic)
This example follows the toy setup described in Section 3.3 (Figure 5), but focuses specifically on how the pairwise discriminator loss creates training signals.

  • Suppose a prompt x produces a teacher answer y_t and the student samples two candidate answers y_s^1, y_s^2 (in the toy case these are “classes” 0–9, but the logic is identical for text).
  • The discriminator outputs scalar scores D(y_t), D(y_s^1), D(y_s^2) for the concatenated prompt+response.
  • For each pair (y_t, y_s^i), the discriminator loss term is:
  • L_i = -log σ(D(y_t) - D(y_s^i)) (Eq. (3), Appendix A.1 Eq. (8)).
  • If D(y_t) is already much larger than D(y_s^i), then σ(D(y_t)-D(y_s^i)) ≈ 1 and L_i is small (discriminator is succeeding).
  • If D(y_s^i) becomes larger than D(y_t), then the difference is negative, σ(·) is < 0.5, and the loss becomes large, pushing the discriminator to correct by increasing D(y_t) and/or decreasing D(y_s^i).
  • The generator update tries to increase D(y_s^i) by sampling responses that the discriminator scores highly (Eq. (2)), implemented with policy gradients (Section 2.2; Appendix A.1).

Core configurations / hyperparameters reported (what is specified vs. missing)
What is explicitly provided in the excerpt:

  • Data / training length
  • Training prompts: 200K sampled from LMSYS-Chat-1M-Clean (Section 3.1).
  • Training duration: 3 epochs total; for GAD specifically, 1 warmup epoch + 2 GAD epochs (Appendix A.2; Algorithm 1).
  • Optimization steps: ~2400 steps (Section 3.1; Appendix A.2).
  • Batch size: 256 (Section 3.1; Appendix A.2).
  • PPO mini-batch size: 256 (Section 3.1; Appendix A.2).
  • Checkpoints saved every 50 steps (Section 3.1).

  • Sequence lengths / sampling

  • Max context length: 2048 tokens for instruction prompts and 1536 tokens for responses (Section 3.1; Appendix A.2).
  • Training and sampling temperature: 0.8 (Section 3.1; Appendix A.2; Appendix A.3).

  • RL (GRPO) specific

  • Group size N = 8 (Appendix A.2; see also Appendix A.1).
  • KL weight β = 0.001 (Appendix A.2).
  • Note: The paper states Eq. (7) omits KL regularizer and clip operator “for brevity,” so the complete GRPO implementation details are not fully specified in the excerpt (Appendix A.1).

  • Learning rates

  • LR search range: [1e-6, 5e-6] for both GAD and SeqKD (Appendix A.2).
  • SeqKD uses 5e-6 in experiments (Appendix A.2).
  • For GAD with GPT-5-Chat teacher: 1e-6 for warmup and GAD stages (Appendix A.2).
  • For GAD with Qwen2.5 teacher (Table 7 setting): 5e-6 warmup and 1e-6 GAD stage (Appendix A.2).

  • Compute / hardware (one concrete datapoint)

  • Distilling Qwen2.5-14B-Instruct from GPT-5-Chat takes ~30 hours on 16 H100 GPUs (Appendix A.2).
  • The excerpt does not provide FLOPs/PF-days, total training tokens, or throughput/latency.

What is not provided (cannot be fabricated): - Student/teacher architectural hyperparameters such as number of layers, hidden dimension, attention heads, tokenizer details, optimizer type (AdamW/etc.), weight decay, LR schedule, gradient clipping, etc. These are not specified in the provided text, beyond “More training details can be found in Appendix A.2,” where the excerpt still does not include those items.


4. Key Insights and Innovations

  • (1) Adversarial formulation for black-box on-policy distillation
  • Innovation: framing black-box distillation as a generator–discriminator minimax game where the discriminator distinguishes teacher vs. student outputs and the student learns to fool it (Section 2.1; Figure 2; Eq. (1)).
  • Why it matters: it creates a usable learning signal for student self-generated outputs without any teacher logits, enabling on-policy updates in a setting where KL-based objectives are unavailable (Section 1; Section 2.1–2.2).

  • (2) Discriminator as an on-policy, co-evolving reward model

  • Innovation: interpreting D as an on-policy reward model that is updated online together with the student (Table 1; Section 2.3).
  • Significance: the paper empirically argues this avoids instabilities associated with frozen/off-policy reward models, showing reward hacking in an off-policy discriminator baseline but not in GAD (Section 3.3; Figure 6).

  • (3) Bradley–Terry pairwise loss for discriminator training (vs. standard GAN classifier loss)

  • Innovation: using a Bradley–Terry style preference loss -log σ(D(y_t)-D(y_s)) rather than a binary cross-entropy discriminator (Eq. (3) vs. Eq. (4); Table 4).
  • Significance: the ablation indicates Bradley–Terry improves automatic evaluation scores and stability in their setting (Table 4; Section 3.4).

  • (4) Warmup strategy to balance generator and discriminator

  • Innovation (practical): a coupled warmup where the student is initialized via cross-entropy on teacher outputs and the discriminator is trained before full adversarial training (Section 2.2; Algorithm 1).
  • Significance: ablations show removing generator warmup or discriminator warmup reduces performance (Table 3; Section 3.4), supporting the paper’s claim that balance is important for effective adversarial optimization.

5. Experimental Analysis

Evaluation methodology (datasets, metrics, baselines, setup)

  • Training data
  • Prompts: 200K samples from LMSYS-Chat-1M-Clean with teacher responses from GPT-5-Chat (Section 3.1).
  • Teacher: GPT-5-Chat (Section 3.1).
  • Students: Qwen2.5-{3B,7B,14B}-Instruct, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct (Section 3.1).

  • Baselines

  • “Before Distill.”: the instruction-tuned student model as-is (Table 2).
  • SeqKD: supervised fine-tuning on teacher responses (Section 3.2; Table 2; Figure 1).

  • Test sets

  • In-domain: 500 held-out samples from LMSYS-Chat-1M-Clean (Section 3.1).
  • Out-of-distribution (OOD): 500 Dolly subset, 252 SelfInst, and 80-question Vicuna benchmark (Section 3.1).

  • Metric

  • Automatic: GPT-4o judge scores where GPT-4o generates references and scores model outputs (Section 3.1; Appendix A.3; Figure 8 prompt).
  • Human: pairwise preference labeling (win/tie/lose) on LMSYS-Chat test set for selected models (Section 3.2; Figure 3).

Main quantitative results (with specific numbers)

Key results are summarized in Table 2 (and extended with response lengths in Table 6). Selected highlights:

  • Qwen2.5-14B-Instruct
  • LMSYS: Before 50.0, SeqKD 50.6, GAD 52.1 (Table 2).
  • Dolly: Before 49.1, SeqKD 48.2, GAD 50.4 (Table 2).
  • SelfInst: Before 49.4, SeqKD 49.4, GAD 51.1 (Table 2).
  • Vicuna: Before 50.0, SeqKD 49.7, GAD 51.6 (Table 2).
  • Teacher GPT-5-Chat on LMSYS is 51.7, so GAD’s 52.1 is reported as “comparable” in the paper’s discussion (Table 2; Figure 1 narrative).

  • Qwen2.5-7B-Instruct

  • LMSYS: Before 48.7, SeqKD 49.2, GAD 50.8 (Table 2).
  • OOD gains are also positive for GAD: Dolly 48.5 vs 47.2 SeqKD; SelfInst 50.1 vs 48.3 SeqKD; Vicuna 51.4 vs 49.5 SeqKD (Table 2).

  • Qwen2.5-3B-Instruct

  • LMSYS: Before 45.8, SeqKD 47.5, GAD 48.9 (Table 2).
  • Dolly: 46.7 vs 44.8 SeqKD (Table 2).
  • SelfInst: 47.7 vs 45.7 SeqKD (Table 2).
  • Vicuna: 49.4 vs 48.0 SeqKD (Table 2).

  • Llama students

  • Llama-3.1-8B-Instruct LMSYS: Before 46.9, SeqKD 49.7, GAD 50.3 (Table 2).
  • Llama-3.2-3B-Instruct LMSYS: Before 44.0, SeqKD 47.6, GAD 48.1 (Table 2).
  • OOD: GAD generally improves over SeqKD (e.g., Llama-3.2-3B on Dolly 48.5 vs 47.0; SelfInst 49.1 vs 47.1; Vicuna 48.9 vs 48.1) (Table 2).

You can see the same pattern visually in Figure 1, including the paper’s claim that, for LMSYS test: - Qwen2.5-3B + GAD matches Qwen2.5-7B + SeqKD, and - Qwen2.5-14B + GAD is comparable to GPT-5-Chat (Figure 1; Table 2).

Human evaluation

  • Figure 3 reports win/tie/lose percentages for GAD vs (a) before distillation and (b) SeqKD, on three models:
  • Qwen2.5-7B-Instruct: vs before distill = 68% win / 4% tie / 28% lose; vs SeqKD = 52% win / 40% tie / 8% lose (Figure 3).
  • Qwen2.5-14B-Instruct: vs before distill = 56% win / 28% tie / 16% lose; vs SeqKD = 68% win / 8% tie / 24% lose (Figure 3).
  • Llama-3.1-8B-Instruct: vs before distill = 60% win / 28% tie / 12% lose; vs SeqKD = 44% win / 40% tie / 16% lose (Figure 3).
  • These results support that GAD often wins more than it loses in pairwise comparisons (Section 3.2; Figure 3), though the excerpt does not specify annotator count or inter-annotator agreement.

Analyses and ablations (do they support the mechanism?)

  • SeqKD overfits to local patterns
  • The paper measures n-gram overlap F1 between student and teacher; SeqKD shows higher overlap but lower GPT-4o score than GAD (Section 3.3; Figure 4). This is used to argue SFT memorizes lexical patterns while GAD captures “global stylistic characteristics.”

  • Toy mode behavior

  • The toy experiment claims SeqKD is “mode-covering” while GAD is “mode-seeking,” learning reachable modes (Section 3.3; Figure 5). This is presented as an intuition for why on-policy RL helps distillation.

  • On-policy vs off-policy discriminator

  • Figure 6 shows an “off-policy discriminator” setup leading to reward hacking: response lengths explode up to ~1300 tokens after ~300 steps, while GAD remains stable (Section 3.3; Figure 6). This supports the claim that the co-evolving discriminator stabilizes training.

  • Warmup ablation

  • Table 3 (on Qwen2.5-7B-Instruct) shows:
    • GAD: LMSYS 50.8, Others 50.0.
    • w/o generator warmup: LMSYS 49.7, Others 49.7.
    • w/o discriminator warmup: LMSYS 49.0, Others 47.7.
  • This supports the claim in Section 2.2 that warmup is important.

  • Discriminator loss ablation

  • Table 4 (on Qwen2.5-3B-Instruct) shows Bradley–Terry discriminator outperforms CE discriminator:
    • GAD (BT loss): LMSYS 48.9, Others 47.9.
    • Disc. CE loss: LMSYS 47.9, Others 46.4.
  • This supports choosing Eq. (3) over Eq. (4).

  • Discriminator size ablation

  • Table 5 shows equal generator/discriminator sizes perform best; increasing discriminator size does not help (e.g., 3B gen + 7B disc drops vs 3B+3B) (Section 3.4; Table 5). This supports a “balance” hypothesis.

Are the experiments convincing?

  • Strengths:
  • Multiple student families and sizes (Qwen2.5 and Llama) (Table 2).
  • Both in-domain and OOD test sets (Section 3.1; Table 2).
  • Mechanism-supporting analyses: n-gram overlap (Figure 4), toy modes (Figure 5), off-policy reward hacking contrast (Figure 6), and several ablations (Tables 3–5).

  • Caveats (based strictly on what is provided):

  • The primary automatic metric is GPT-4o judging (Section 3.1; Appendix A.3). The excerpt does not report classical task metrics or multiple independent judge models, so robustness to judge choice is unclear from the provided text.
  • The excerpt does not specify variance/error bars, multiple random seeds, or statistical tests for Table 2 scores.

6. Limitations and Trade-offs

  • Dependence on teacher-generated data and teacher querying costs
  • Although black-box, the approach still requires collecting teacher responses for many prompts (Section 2.1; Section 3.1 uses 200K prompts), which can be costly with an API teacher. The excerpt does not quantify API cost.

  • Training complexity and stability management

  • GAD requires training two models (generator + discriminator) and running RL updates; it is more complex than SeqKD SFT (Algorithm 1; Section 2.2–2.3).
  • The paper explicitly needs a warmup stage to stabilize training (Section 2.2; Table 3), implying sensitivity to initialization/balance.

  • Potential reward hacking remains a general concern

  • The paper shows off-policy reward hacking in Figure 6 and argues on-policy co-evolution helps, but it does not claim reward hacking is impossible; it shows an empirical contrast for their setup (Section 3.3; Figure 6).

  • Evaluation limitations

  • Automatic evaluation relies heavily on GPT-4o as judge (Section 3.1; Appendix A.3), which may introduce systematic biases. The excerpt does not provide alternative automatic metrics to triangulate results.

  • Missing reproducibility details (in the provided excerpt)

  • Key training details are not included here: optimizer type, LR schedule, gradient clipping, exact GRPO clipping/KL formulation (the paper says it omits some parts in Eq. (7)), model architecture parameters, tokenizer settings, and total tokens processed (Appendix A.1–A.2 mention omissions and do not enumerate these items in the excerpt).

  • Scope of tasks

  • Evaluations are on chat/instruction datasets (LMSYS, Dolly, SelfInst, Vicuna) (Section 3.1). The excerpt does not show domain-specific tasks (e.g., coding, math) or safety/alignment evaluations, so generality beyond these instruction-following benchmarks is not established here.

7. Implications and Future Directions

  • How this changes the landscape (within the paper’s scope)
  • GAD suggests a concrete way to do on-policy distillation without logits, which directly targets a key obstacle in black-box distillation (Section 1; Section 2.1–2.3).
  • If the reported gains generalize, it positions adversarial + RL-style training as a stronger alternative to pure SFT-based SeqKD for black-box settings, especially for OOD generalization (Section 3.2; Figure 1; Table 2).

  • Follow-up research suggested by the results

  • Reward model/discriminator design: The discriminator is initialized from the generator with an added head (Section 2.1), and discriminator size balancing matters (Table 5). Future work could explore alternative scoring architectures or regularizers, but the excerpt itself only tests CE vs Bradley–Terry (Table 4) and size scaling (Table 5).
  • Stability and anti-hacking mechanisms: Figure 6 motivates more systematic study of when on-policy co-evolution prevents reward hacking and when it fails (Section 3.3).
  • Broader evaluations: Given reliance on GPT-4o judging, future work could validate with additional judges/metrics and more task diversity; this is not done in the provided excerpt but is a natural next step based on the evaluation setup (Section 3.1; Appendix A.3).

  • Practical applications / downstream use cases

  • Distilling an API teacher into an open-weight student when logits are unavailable (Section 1; Section 3.1).
  • Distillation across tokenizer-incompatible model families (Appendix B.1 “Qwen2.5 Teacher”; Table 7), where logit-level distillation is difficult.

  • Repro/Integration Guidance: when to prefer this over alternatives (based on the paper)

  • Prefer SeqKD (SFT on teacher responses) when you want a simpler pipeline and can accept weaker OOD gains; the paper shows SeqKD can be marginal/negative on OOD sets while GAD improves consistently (Section 3.2; Table 2).
  • Prefer GAD when:
    • The teacher is black-box (text-only), and
    • You want on-policy improvement and better generalization, and
    • You can afford RL training complexity (Algorithm 1; Section 3.1 compute note in Appendix A.2).
  • Implementation notes grounded in the excerpt:
    • Use the warmup stage (Section 2.2; Table 3).
    • Use Bradley–Terry discriminator loss rather than CE discriminator loss (Table 4; Eq. (3) vs Eq. (4)).
    • Keep generator and discriminator sizes balanced (Table 5).
    • Use the reported training configuration as a starting point: batch size 256, temperature 0.8, context lengths 2048/1536, GRPO group size 8, KL weight 0.001, and learning rate around 1e-6 for GAD with the GPT-5-Chat teacher in their experiments (Section 3.1; Appendix A.2).