Skip to content

Draft with Diffusion, Verify with Autoregressive Models

ArXiv: 2512.15176

🎯 Pitch

The paper introduces DEER, a lossless speculative decoding framework that drafts entire token blocks using a discrete diffusion LLM (dLLM) and then verifies them with a standard autoregressive (AR) model, replacing sequential AR drafters with one-step parallel generation. This removes left-to-right error accumulation and enables much longer accepted draft blocks, yielding large practical speedups in inference (e.g., 5.54× on HumanEval with Qwen3-30B-A3B) while preserving the target model’s exact output distribution.


1. Executive Summary (2-3 sentences)

DEER is a speculative decoding framework that drafts token blocks using a discrete diffusion language model (dLLM) and then verifies them with a standard autoregressive (AR) target model, aiming to reduce inference latency without changing the target model’s output distribution. The key practical significance is that diffusion drafting can generate blocks in parallel and avoid the left-to-right error accumulation that limits acceptance lengths in AR drafters, yielding substantially larger speedups (e.g., 5.54× on HumanEval with Qwen3-30B-A3B at temperature 0; Table 2).

2. Context and Motivation

  • Problem / gap addressed
  • Autoregressive decoding generates tokens strictly left-to-right, which is inherently sequential and therefore latency/throughput limited for LLM-based “agentic” and reasoning-style systems (Introduction).
  • Speculative decoding reduces this cost via a draft–verify workflow: a lightweight “drafter” proposes several next tokens, and the expensive target model verifies (and corrects) them.

  • Why it matters

  • The bottleneck is practical: inference speed constrains interactive systems and long-context workloads (Introduction).
  • A “lossless” acceleration method is especially valuable: it accelerates while preserving the exact distribution of the target AR sampler (Section 4.1 “Metrics”; Appendix E).

  • Prior approaches and shortfalls (as framed here)

  • Most existing speculative decoding methods rely on AR drafters (Related Work §2.1).
  • The paper identifies two structural limitations of AR drafters (Introduction; Section 3):
    1. Step-wise uncertainty accumulation (“gradual collapse of trust”): errors in early draft tokens propagate because each next drafted token conditions on previously drafted (unverified) tokens.
    2. Sequential drafting: AR drafters still decode left-to-right, limiting parallelism.
  • Tree/head-based AR drafting (e.g., Medusa/Hydra/EAGLE) can help but remains tied to sequential dependencies during drafting (Related Work §2.1).

  • How this paper positions itself

  • It proposes using a discrete diffusion LLM as the sole drafter to (i) draft blocks in parallel and (ii) reduce uncertainty accumulation by making within-block proposals independent of earlier draft tokens (Section 3.2).
  • A major obstacle is distribution mismatch: pretrained diffusion LMs are trained for global denoising, not prefix-conditioned continuation; the paper’s response is a dedicated Diffusion-to-AR alignment training pipeline (Section 3.1).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a speculative decoding stack where a small diffusion model proposes a block of next tokens and a large AR model verifies them.
  • It solves the AR decoding latency problem by changing how candidate tokens are proposed (parallel diffusion drafting) while keeping correctness through AR verification with rejection/resampling.

3.2 Big-picture architecture (diagram in words)

  • Training (D2A Alignment):
  • Start from a pretrained dLLM (for code: an open-dCoder-derived diffusion model; Section 4.1 + Appendix A).
  • Stage I: AR-style continuation distillation to make the dLLM behave like “predict the continuation after a prefix + [SEP]”.
  • Stage II: Scribe refinement to improve accuracy near the prefix boundary using weighted suffix masking.

  • Inference (DEER decoding loop):

  • Given prefix x1:j, the dLLM proposes a length-k block Ć·j+1:j+k in parallel.
  • The target AR model verifies token-by-token using an acceptance probability ratio (Eq. (5)).
  • If a token is rejected, it is replaced by a corrected sample via a residual construction (Eq. (6)).
  • Repeat until EOS/length limit (Algorithm 1).

3.3 Roadmap for the deep dive

  • I first explain why AR drafters fail structurally (uncertainty accumulation and sequentiality).
  • Then I describe the two-stage training pipeline that makes diffusion drafting usable for prefix continuations (Stage I and II).
  • Next I walk through DEER inference step-by-step, including the acceptance rule and why it is lossless.
  • Finally I connect the mechanism to the reported metrics: acceptance length τ and speedup.

3.4 Detailed, sentence-based technical breakdown

  • Framing (type of paper + core idea).
  • This is primarily an empirical systems/algorithm paper: it designs a speculative decoding framework and training procedure so that a discrete diffusion model can act as an effective drafter for AR verification, with experiments showing higher throughput and longer accepted blocks (Sections 3–4).

  • Key definitions (paper-specific or potentially unfamiliar).

  • Speculative decoding is a two-model sampling procedure that accelerates generation by drafting multiple tokens from a proposal model and then verifying them against a target model; rejection/resampling ensures the final sample follows the target distribution (Algorithm 1; Appendix E).
  • Acceptance length (τ) is the average number of drafted tokens accepted per draft–verify cycle, which correlates with how many tokens the system can effectively produce “per expensive verification” (Section 4.1; Tables 1–2).
  • A discrete diffusion language model (dLLM) generates text by predicting masked tokens via denoising in token space (using a mask token M), rather than predicting the next token causally left-to-right (Section 3.1–3.2).

  • System/data pipeline diagram in words (what happens first, second, third).

Training pipeline (Section 3.1; Figure 3; Appendix A): 1. First, prepare continuation-style training targets from an AR teacher. - The AR teacher is the target model distribution p_AR the drafter should align to (Notation in Section 3.1). - A teacher-generated answer A of length L is randomly truncated and a special separator token [SEP] is appended to mark the boundary between known prefix and to-be-generated continuation (Section 3.1.1; Figure 3). 2. Second, run Stage I (AR-style continuation distillation). - The dLLM observes a noised sequence x_t and is trained to reconstruct (denoise) only the masked continuation tokens, using the loss L_Distill (Eq. (1)–(2)). - Concretely, the training loss sums token log-probabilities log p_Ξ(x_i^0 | x_t) only over positions currently masked (1[x_i^t = M]), scaled by 1/(tL) in Eq. (1). - The intent is to change the diffusion model from “global denoise a full sequence” into “given a prefix up to [SEP], generate a plausible continuation,” reducing the distribution mismatch that otherwise makes drafts unstable (Section 3.1.1). 3. Third, run Stage II (Scribe refinement). - Stage II masks only the last R tokens of the answer, with R ~ Uniform(1, 96) (Section 3.1.2). - It applies position-dependent weights w_i = α^(R−i) (Eq. (3)) in the refined objective L_Refine (Eq. (4)), to emphasize correctness near the prefix/verification boundary (Section 3.1.2). - The paper reports this stage is sensitive to α and can diverge if set too aggressively (Figure 6).

Inference pipeline (Section 3.2; Algorithm 1): 1. Draft a block in parallel. - Given current prefix x1:j, the diffusion drafter samples a block of k tokens: - Ć·j+1:j+k ~ q_Ξ(· | x1:j) (Section 3.2). - A crucial modeling property claimed here is within-block independence from prior drafted tokens: - q_Ξ(Ć·_i | x1:j, Ć·1:i−1) = q_Ξ(Ć·_i | x1:j) (Section 3.2), unlike AR drafting. 2. Verify token-by-token against the AR target. - For each token position i = 1..k inside the block, compute acceptance probability: - α_i = min(1, p_AR(Ć·j+i | x1:j+i−1) / q_Ξ(Ć·j+i | x1:j)) (Eq. (5)). - Accept with probability α_i; otherwise reject and replace. 3. If rejected, resample from the residual distribution. - The paper gives a residual-form update: - Ć·j+i ∝ max(0, p_AR(· | x1:j+i−1) − q_Ξ(· | x1:j)) (Eq. (6)). - Algorithm 1 states that when rejected, the token is sampled “via Eq. 6” and appended to the prefix. 4. Append accepted/resampled tokens to the prefix, continue until EOS. - The context for later positions includes whichever token was chosen at each position (accepted draft or AR replacement), matching standard speculative decoding flow (Algorithm 1).

  • Worked micro-example (single input → output walk-through).
  • Suppose the current prefix is x1:j = "The capital", and block size k = 4.
  • The dLLM proposes in parallel: Ć·j+1:j+4 = [" of", " France", " is", " Paris"].
  • Verification proceeds left-to-right inside the block:
    1. For token " of", compute α_1 = min(1, p_AR(" of" | "The capital") / q_Ξ(" of" | "The capital")) (Eq. (5)).
    2. If accepted, append " of" to the prefix.
    3. If rejected, sample a replacement token from the residual (Eq. (6)) and append that instead.
    4. For token " France", compute α_2 using the updated prefix x1:j+1 (now includes whatever was chosen at step 1), but the denominator remains q_Ξ(· | x1:j) as written in Eq. (5).
  • This continues until either all 4 tokens are processed or EOS appears (Algorithm 1).
  • The claimed efficiency gain is that the expensive model’s work is reduced when many tokens are accepted, and the diffusion drafter can propose the block with parallel computation.

  • Why diffusion drafting reduces “uncertainty accumulation” in this setup (mechanistic explanation).

  • With an AR drafter, the proposal for later tokens depends on earlier drafted tokens:
    • q_AR(Ć·_i | x1:j, Ć·1:i−1) ≠ q_AR(Ć·_i | x1:j) (Section 3.2).
  • If early draft tokens differ from the AR model’s preferred tokens, this mismatch compounds because the AR drafter continues drafting conditioned on its own (potentially wrong) outputs, shrinking acceptance rates deeper in the block (Introduction; Section 3.2).
  • The diffusion drafter is trained to propose tokens based on the prefix with masked conditioning, giving the paper’s key independence relation:
    • q_Ξ(Ć·_i | x1:j, Ć·1:i−1) = q_Ξ(Ć·_i | x1:j) (Section 3.2).
  • As presented, this removes the drafter’s internal left-to-right feedback loop, so later positions are less affected by earlier drafter mistakes.

  • Correctness / “losslessness”.

  • The paper includes a formal proof that the algorithm is lossless, meaning the final output distribution matches direct sampling from the target AR model p_AR (Appendix E, Theorem E.2).
  • The proof is structured via a one-step lemma (Lemma E.1) for a single token position using an accept/reject rule with a residual distribution, then extended to the full sequence by induction (Appendix E.1–E.3).
  • A stated requirement is a support condition: if p_AR(x) > 0 then the proposal q_Ξ(x) > 0 (Appendix E.1 Eq. (7); Appendix E.2 notes the training “ensures” this, though the mechanism for guaranteeing strict support is not elaborated in the provided excerpt).

  • Core configurations / hyperparameters explicitly given (and what is missing).

  • Hardware: experiments run on 8× NVIDIA A100 GPUs (80 GB each) (Appendix A).
  • Code drafter model: a 0.5B-parameter discrete diffusion drafter derived by modifying open-dCoder / open-dLLM (Appendix A; Section 4.1 “Models”).
  • Math drafter model: a 0.5B-parameter drafter initialized from Qwen2.5-0.5B-Instruct and converted to a diffusion model with a diffusion decoding head (Section 4.1; Appendix A).
  • Optimization (code tasks):
    • Stage I: AdamW, learning rate 1e-4, 1 epoch over the code training corpus (Appendix A).
    • Stage II: AdamW, learning rate 5e-5, 1 epoch on 100k examples subset (Appendix A).
  • Optimization (math tasks):
    • Pretraining/conversion training: 40 epochs on UltraChat to obtain a partially converged dLLM (Section 4.6; Appendix A).
    • Stage I: AdamW, learning rate 1e-4, 5 epochs (Appendix A).
    • Stage II: AdamW, learning rate 1e-4, 1 epoch (Appendix A).
  • Stage II masking hyperparameters: R ~ Uniform(1, 96) and weight base α (Section 3.1.2; Figure 6).
  • Block size k: used in inference (Algorithm 1); sensitivity to block size is explored for k ∈ {4, 8, 16, 32} (Figure 8).
  • Missing (not provided in the excerpt):
    • Target model decoding settings beyond temperature (e.g., top-p/top-k).
    • Tokenizer details, context window length, and architecture specifics (layers/hidden size/heads) for the dLLM and targets.
    • Total training tokens, exact batch sizes, weight decay, and learning rate schedule.

4. Key Insights and Innovations

  • (1) Diffusion-as-drafter for speculative decoding with one-step block drafting
  • Novelty: DEER uses a discrete-space dLLM as the sole drafter (Contribution bullets; Section 3), rather than an AR drafter or hybrid.
  • Significance: This is positioned as addressing both AR-drafter bottlenecks simultaneously:

    • Parallel block generation (efficiency),
    • Reduced uncertainty accumulation via within-block proposal independence (acceptance length).
  • (2) Two-stage Diffusion-to-AR (D2A) alignment to fix distribution mismatch

  • Stage I (Section 3.1.1): “AR-style continuation distillation” transforms the dLLM’s training into prefix-conditioned continuation by truncating teacher answers and inserting [SEP].
  • Stage II (Section 3.1.2): “Scribe refinement” focuses accuracy near the verification boundary with weighted suffix masking and exponential weights.
  • Significance: Experiments show Stage II increases accepted-token length on harder benchmarks (Table 3), which directly affects speedup.

  • (3) Empirical acceptance-length gains to long blocks (up to 32 tokens)

  • DEER reaches maximum accepted lengths of 32 tokens across multiple target model scales (Table 4).
  • Compared baseline: EAGLE-3’s max accepted length is 7–8 tokens in the same table.

  • (4) Reported emergent behavior: “reliable block regeneration”

  • The paper claims the aligned dLLM can repeatedly regenerate partially masked suffix blocks coherently (“reliable block regeneration”; contribution bullets; Section 4.5; Figure 5).
  • This is presented as a qualitative capability enabled by the training pipeline rather than architectural changes (Section 4.5).

5. Experimental Analysis

  • Evaluation methodology (Section 4.1).
  • Code generation:
    • Training data: OpenCodeInstruct (Section 4.1).
    • Evaluation benchmarks: HumanEval, MBPP, LiveCodeBench, CodeAlpacaPy (Python subset) (Section 4.1).
  • Math reasoning:
    • Training data: UltraChat and ShareGPT (Section 4.1).
    • Evaluation benchmarks: GSM8K, Math500, Minerva Math (Section 4.1).
  • Baselines: Medusa, Hydra, EAGLE-3 (Section 4.1).
  • Metrics:
    • Speedup ratio: end-to-end speedup vs standard AR decoding.
    • Average acceptance length (τ): accepted tokens per speculative cycle (Section 4.1).
  • Temperatures reported:
    • Temperature 0.6 (Table 1; and math Table 6).
    • Temperature 0 (Table 2; Figure 1).
  • KV cache conditions:

    • Many code benchmark results are “with KV cache” (Tables 1–2; Section 4.2.1).
    • Batch inference study is explicitly without KV cache for both AR and DEER due to lack of mature dLLM KV-cache deployment (Section 4.4; Table 5; Appendix F).
  • Main quantitative results (code benchmarks).

Temperature = 0, with KV cache (Table 2): - Qwen3-30B-A3B on HumanEval: - DEER: 5.54× speedup, τ = 6.58 - EAGLE-3: 2.41× speedup, τ = 3.21 - This is the headline result also echoed in the abstract/intro. - Mean across the four code benchmarks (MBPP, CodeAlpacaPy, HumanEval, LiveCodeBench): - For Qwen3-30B-A3B: DEER mean 4.04×, τ = 5.03 vs EAGLE-3 mean 2.21×, τ = 3.05 (Table 2). - For smaller targets, DEER similarly improves mean speedup and τ: - Qwen3-14B: mean 2.98×, τ = 4.82 vs EAGLE-3 mean 2.39×, τ = 3.54 (Table 2). - Qwen3-8B: mean 2.83×, τ = 4.61 vs EAGLE-3 mean 2.48×, τ = 3.45 (Table 2). - Qwen3-4B: mean 2.77×, τ = 4.61 vs EAGLE-3 mean 2.40×, τ = 3.31 (Table 2).

Temperature = 0.6, with KV cache (Table 1): - For Qwen3-30B-A3B, mean DEER speedup 3.62×, τ = 4.45 vs EAGLE-3 mean 2.01×, τ = 2.40 (Table 1). - This shows DEER’s advantage persists under nonzero temperature, though absolute values differ.

  • Acceptance distribution and maximum accepted lengths.
  • Max accepted length is 32 tokens for DEER across Qwen3 targets; EAGLE-3 is 7–8 tokens (Table 4).
  • The fraction of “long accepts” (≄ 8 tokens) is reported around 15.8%–16.5% across Qwen3-4B/8B/14B in Figure 4.
  • Appendix B.1 / Figure 7 reports a “long-block resurgence effect”: counts roughly decay exponentially up to about 30 tokens, then increase near the maximum range (the explanation given is reduced exposure to left-to-right error propagation when later draft tokens are not conditioned on earlier draft tokens).

  • Ablation: impact of Stage II refinement.

  • On Qwen3-30B-A3B, Stage II increases average accepted length across all four code benchmarks (Table 3):
    • MBPP: 4.74 → 4.87
    • CodeAlpacaPy: 3.47 → 4.04
    • HumanEval: 5.38 → 6.58
    • LiveCodeBench: 3.87 → 5.03
  • The largest gains are on the harder benchmarks (HumanEval, LiveCodeBench), consistent with the claim that boundary accuracy is crucial for acceptance (Section 4.3.1).

  • Sensitivity analysis.

  • Stage II weight base α sensitivity (Figure 6):
    • α = 1.01 shows stable decreasing loss.
    • α = 1.02 is noisier with upward drift.
    • α = 1.05 diverges early.
  • Block size k sensitivity (Appendix B.2; Figure 8):

    • Average acceptance length increases with block size from 4 to 32, with diminishing returns.
    • Larger target backbones (e.g., Qwen3-30B-A3B) show higher acceptance than smaller ones across the same block sizes.
  • Batch inference scalability (Section 4.4; Table 5).

  • Measured on HumanEval throughput (tokens/s), without KV cache for both methods:
    • Batch 8: AR 38.35 vs DEER 159.87 tokens/s.
    • Batch 16: AR 49.76 vs DEER 175.66 tokens/s.
  • The paper attributes stronger scaling to DEER’s parallel drafting and better GPU utilization at larger batch sizes (Section 4.4).

  • Math reasoning results (Section 4.6; Table 6).

  • Using Qwen3-30B-A3B at temperature 0.6, DEER improves speedup over EAGLE-3 on all three math benchmarks (Table 6):
    • Math500: 1.89× → 2.12×
    • GSM8K: 1.92× → 2.23×
    • Minerva Math: 1.91× → 2.02×
    • Mean: 1.91× → 2.12×
  • Table 6 labels the second metric as “Kendall’s τ correlation,” but the main text interprets these values as acceptance length τ (e.g., “acceptance length from 2.04 → 2.45”); this is an internal inconsistency in the provided content.

  • Do the experiments support the claims?

  • The core claim—longer accepted blocks and higher speedups vs AR-drafter baselines—is directly supported by:
    • Large deltas in τ and speedup in Tables 1–2,
    • Max acceptance length jump in Table 4,
    • Stage II ablation in Table 3.
  • The claim that diffusion drafting avoids uncertainty accumulation is supported indirectly by acceptance-length distributions (Figure 4; Figure 7) and the conceptual independence argument (Section 3.2), but the excerpt does not include a direct quantitative measure of uncertainty accumulation (e.g., KL vs depth) beyond the narrative.

6. Limitations and Trade-offs

  • KV-cache and deployment maturity for dLLMs
  • The paper explicitly notes there is “no mature framework for efficient dLLM+KV-cache deployment,” limiting integration with common serving stacks; batch experiments are run without KV cache (Section 4.4; Appendix F).
  • This means the best-case production performance might depend on future infrastructure improvements (Appendix F).

  • Training stability sensitivity in Stage II

  • Stage II’s exponential weighting has a narrow stability window: α = 1.05 diverges in the shown experiment (Figure 6).
  • This introduces hyperparameter fragility and potentially higher tuning cost.

  • Draft quality depends on alignment; diffusion models are not naturally prefix continuations

  • The approach requires a specific training pipeline to mitigate distribution mismatch (Section 3.1).
  • For domains without good pretrained diffusion LMs, the paper had to construct one (math setting) and reports it is “only partially converged” (Section 4.6), which likely caps achievable acceptance lengths/speedups there.

  • Average acceptance length is still much smaller than maximum

  • While maximum accepted length reaches 32 (Table 4), reported average τ values on code benchmarks are typically around ~4–6.6 depending on model/temperature (Tables 1–2), indicating long accepts exist but are not the dominant case.

  • Incomplete reporting of model/training details in the provided excerpt

  • Important reproducibility details (tokenizer, context length, batch size, LR schedule, total tokens) are not included here, limiting the ability to fully audit or replicate.

  • Comparability vs baselines can be constrained by feasibility

  • The paper reports OOM for Medusa and Hydra at Qwen3-14B under default configs, so large-scale comparisons are incomplete there (Appendix D; Table 8).

7. Implications and Future Directions

  • How this changes the landscape
  • It makes a concrete case that diffusion-based LMs can be practical inference accelerators for AR LLMs when used as drafters, not just as alternative generative models.
  • The acceptance-length improvements suggest that breaking the AR drafter’s left-to-right dependency can be a first-order factor in speculative decoding speedups (Section 3.2; Tables 1–2; Table 4).

  • Follow-up research enabled/suggested (grounded in the paper’s gaps)

  • Better dLLM KV-cache integration: Appendix F points to ongoing work and suggests DEER could benefit substantially once diffusion KV caching is available in mainstream serving frameworks.
  • Stronger diffusion pretraining for non-code domains: Section 4.6 shows acceleration even with a partially converged math dLLM, implying more convergence could further increase τ and speedup.
  • Understanding “long-block resurgence”: Appendix B.1 introduces this empirical phenomenon (Figure 7) and proposes an explanation; it invites deeper analysis or modeling.

  • Practical applications / downstream use cases

  • Any setting dominated by AR decoding latency where you can afford an extra small model:

    • Code generation assistants (HumanEval/MBPP/LiveCodeBench-style workloads; Section 4.2),
    • Batched inference scenarios where parallel drafting improves utilization (Section 4.4; Table 5).
  • Repro/Integration Guidance (when to prefer this)

  • Prefer DEER over AR-drafter speculative decoding when:
    • You can deploy a ~0.5B parameter drafter alongside a large AR model (Appendix A),
    • You want longer accepted runs (higher τ) and you can exploit parallelism.
  • Training cost context (Appendix D; Table 8):
    • On Qwen3-8B, DEER drafter fine-tuning is reported as ~240 GPU-hours, compared to 696 for EAGLE-3 and higher for Hydra (noting these are “approximate wall-clock” under their setup).
  • Drafter size is not the sole explanation in their analysis:
    • DEER uses a 470M-parameter drafter across targets, while EAGLE-3 draft heads range from 140M–610M depending on target (Appendix C; Table 7), yet DEER still achieves higher τ and max accept lengths (Tables 2 and 4).