Draft with Diffusion, Verify with Autoregressive Models¶
ArXiv: 2512.15176
đŻ Pitch¶
The paper introduces DEER, a lossless speculative decoding framework that drafts entire token blocks using a discrete diffusion LLM (dLLM) and then verifies them with a standard autoregressive (AR) model, replacing sequential AR drafters with one-step parallel generation. This removes left-to-right error accumulation and enables much longer accepted draft blocks, yielding large practical speedups in inference (e.g., 5.54Ă on HumanEval with Qwen3-30B-A3B) while preserving the target modelâs exact output distribution.
1. Executive Summary (2-3 sentences)¶
DEER is a speculative decoding framework that drafts token blocks using a discrete diffusion language model (dLLM) and then verifies them with a standard autoregressive (AR) target model, aiming to reduce inference latency without changing the target modelâs output distribution. The key practical significance is that diffusion drafting can generate blocks in parallel and avoid the left-to-right error accumulation that limits acceptance lengths in AR drafters, yielding substantially larger speedups (e.g., 5.54Ă on HumanEval with Qwen3-30B-A3B at temperature 0; Table 2).
2. Context and Motivation¶
- Problem / gap addressed
- Autoregressive decoding generates tokens strictly left-to-right, which is inherently sequential and therefore latency/throughput limited for LLM-based âagenticâ and reasoning-style systems (Introduction).
-
Speculative decodingreduces this cost via a draftâverify workflow: a lightweight âdrafterâ proposes several next tokens, and the expensive target model verifies (and corrects) them. -
Why it matters
- The bottleneck is practical: inference speed constrains interactive systems and long-context workloads (Introduction).
-
A âlosslessâ acceleration method is especially valuable: it accelerates while preserving the exact distribution of the target AR sampler (Section 4.1 âMetricsâ; Appendix E).
-
Prior approaches and shortfalls (as framed here)
- Most existing speculative decoding methods rely on AR drafters (Related Work §2.1).
- The paper identifies two structural limitations of AR drafters (Introduction; Section 3):
- Step-wise uncertainty accumulation (âgradual collapse of trustâ): errors in early draft tokens propagate because each next drafted token conditions on previously drafted (unverified) tokens.
- Sequential drafting: AR drafters still decode left-to-right, limiting parallelism.
-
Tree/head-based AR drafting (e.g., Medusa/Hydra/EAGLE) can help but remains tied to sequential dependencies during drafting (Related Work §2.1).
-
How this paper positions itself
- It proposes using a discrete diffusion LLM as the sole drafter to (i) draft blocks in parallel and (ii) reduce uncertainty accumulation by making within-block proposals independent of earlier draft tokens (Section 3.2).
- A major obstacle is distribution mismatch: pretrained diffusion LMs are trained for global denoising, not prefix-conditioned continuation; the paperâs response is a dedicated Diffusion-to-AR alignment training pipeline (Section 3.1).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a speculative decoding stack where a small diffusion model proposes a block of next tokens and a large AR model verifies them.
- It solves the AR decoding latency problem by changing how candidate tokens are proposed (parallel diffusion drafting) while keeping correctness through AR verification with rejection/resampling.
3.2 Big-picture architecture (diagram in words)¶
- Training (D2A Alignment):
- Start from a pretrained
dLLM(for code: an open-dCoder-derived diffusion model; Section 4.1 + Appendix A). - Stage I: AR-style continuation distillation to make the
dLLMbehave like âpredict the continuation after a prefix +[SEP]â. -
Stage II: Scribe refinement to improve accuracy near the prefix boundary using weighted suffix masking.
-
Inference (DEER decoding loop):
- Given prefix
x1:j, thedLLMproposes a length-kblockĆ·j+1:j+kin parallel. - The target
ARmodel verifies token-by-token using an acceptance probability ratio (Eq. (5)). - If a token is rejected, it is replaced by a corrected sample via a residual construction (Eq. (6)).
- Repeat until EOS/length limit (Algorithm 1).
3.3 Roadmap for the deep dive¶
- I first explain why AR drafters fail structurally (uncertainty accumulation and sequentiality).
- Then I describe the two-stage training pipeline that makes diffusion drafting usable for prefix continuations (Stage I and II).
- Next I walk through DEER inference step-by-step, including the acceptance rule and why it is lossless.
- Finally I connect the mechanism to the reported metrics: acceptance length
Ïand speedup.
3.4 Detailed, sentence-based technical breakdown¶
- Framing (type of paper + core idea).
-
This is primarily an empirical systems/algorithm paper: it designs a speculative decoding framework and training procedure so that a discrete diffusion model can act as an effective drafter for AR verification, with experiments showing higher throughput and longer accepted blocks (Sections 3â4).
-
Key definitions (paper-specific or potentially unfamiliar).
Speculative decodingis a two-model sampling procedure that accelerates generation by drafting multiple tokens from a proposal model and then verifying them against a target model; rejection/resampling ensures the final sample follows the target distribution (Algorithm 1; Appendix E).Acceptance length (Ï)is the average number of drafted tokens accepted per draftâverify cycle, which correlates with how many tokens the system can effectively produce âper expensive verificationâ (Section 4.1; Tables 1â2).-
A
discrete diffusion language model (dLLM)generates text by predicting masked tokens via denoising in token space (using a mask tokenM), rather than predicting the next token causally left-to-right (Section 3.1â3.2). -
System/data pipeline diagram in words (what happens first, second, third).
Training pipeline (Section 3.1; Figure 3; Appendix A):
1. First, prepare continuation-style training targets from an AR teacher.
- The AR teacher is the target model distribution p_AR the drafter should align to (Notation in Section 3.1).
- A teacher-generated answer A of length L is randomly truncated and a special separator token [SEP] is appended to mark the boundary between known prefix and to-be-generated continuation (Section 3.1.1; Figure 3).
2. Second, run Stage I (AR-style continuation distillation).
- The dLLM observes a noised sequence x_t and is trained to reconstruct (denoise) only the masked continuation tokens, using the loss L_Distill (Eq. (1)â(2)).
- Concretely, the training loss sums token log-probabilities log p_Ξ(x_i^0 | x_t) only over positions currently masked (1[x_i^t = M]), scaled by 1/(tL) in Eq. (1).
- The intent is to change the diffusion model from âglobal denoise a full sequenceâ into âgiven a prefix up to [SEP], generate a plausible continuation,â reducing the distribution mismatch that otherwise makes drafts unstable (Section 3.1.1).
3. Third, run Stage II (Scribe refinement).
- Stage II masks only the last R tokens of the answer, with R ~ Uniform(1, 96) (Section 3.1.2).
- It applies position-dependent weights w_i = α^(Râi) (Eq. (3)) in the refined objective L_Refine (Eq. (4)), to emphasize correctness near the prefix/verification boundary (Section 3.1.2).
- The paper reports this stage is sensitive to α and can diverge if set too aggressively (Figure 6).
Inference pipeline (Section 3.2; Algorithm 1):
1. Draft a block in parallel.
- Given current prefix x1:j, the diffusion drafter samples a block of k tokens:
- Ʒj+1:j+k ~ q_Ξ(· | x1:j) (Section 3.2).
- A crucial modeling property claimed here is within-block independence from prior drafted tokens:
- q_Ξ(Ć·_i | x1:j, Ć·1:iâ1) = q_Ξ(Ć·_i | x1:j) (Section 3.2), unlike AR drafting.
2. Verify token-by-token against the AR target.
- For each token position i = 1..k inside the block, compute acceptance probability:
- α_i = min(1, p_AR(Ć·j+i | x1:j+iâ1) / q_Ξ(Ć·j+i | x1:j)) (Eq. (5)).
- Accept with probability α_i; otherwise reject and replace.
3. If rejected, resample from the residual distribution.
- The paper gives a residual-form update:
- Ć·j+i â max(0, p_AR(· | x1:j+iâ1) â q_Ξ(· | x1:j)) (Eq. (6)).
- Algorithm 1 states that when rejected, the token is sampled âvia Eq. 6â and appended to the prefix.
4. Append accepted/resampled tokens to the prefix, continue until EOS.
- The context for later positions includes whichever token was chosen at each position (accepted draft or AR replacement), matching standard speculative decoding flow (Algorithm 1).
- Worked micro-example (single input â output walk-through).
- Suppose the current prefix is
x1:j = "The capital", and block sizek = 4. - The
dLLMproposes in parallel:Ć·j+1:j+4 = [" of", " France", " is", " Paris"]. - Verification proceeds left-to-right inside the block:
- For token
" of", computeα_1 = min(1, p_AR(" of" | "The capital") / q_Ξ(" of" | "The capital"))(Eq. (5)). - If accepted, append
" of"to the prefix. - If rejected, sample a replacement token from the residual (Eq. (6)) and append that instead.
- For token
" France", computeα_2using the updated prefixx1:j+1(now includes whatever was chosen at step 1), but the denominator remainsq_Ξ(· | x1:j)as written in Eq. (5).
- For token
- This continues until either all 4 tokens are processed or EOS appears (Algorithm 1).
-
The claimed efficiency gain is that the expensive modelâs work is reduced when many tokens are accepted, and the diffusion drafter can propose the block with parallel computation.
-
Why diffusion drafting reduces âuncertainty accumulationâ in this setup (mechanistic explanation).
- With an AR drafter, the proposal for later tokens depends on earlier drafted tokens:
q_AR(Ć·_i | x1:j, Ć·1:iâ1) â q_AR(Ć·_i | x1:j)(Section 3.2).
- If early draft tokens differ from the AR modelâs preferred tokens, this mismatch compounds because the AR drafter continues drafting conditioned on its own (potentially wrong) outputs, shrinking acceptance rates deeper in the block (Introduction; Section 3.2).
- The diffusion drafter is trained to propose tokens based on the prefix with masked conditioning, giving the paperâs key independence relation:
q_Ξ(Ć·_i | x1:j, Ć·1:iâ1) = q_Ξ(Ć·_i | x1:j)(Section 3.2).
-
As presented, this removes the drafterâs internal left-to-right feedback loop, so later positions are less affected by earlier drafter mistakes.
-
Correctness / âlosslessnessâ.
- The paper includes a formal proof that the algorithm is lossless, meaning the final output distribution matches direct sampling from the target AR model
p_AR(Appendix E, Theorem E.2). - The proof is structured via a one-step lemma (Lemma E.1) for a single token position using an accept/reject rule with a residual distribution, then extended to the full sequence by induction (Appendix E.1âE.3).
-
A stated requirement is a support condition: if
p_AR(x) > 0then the proposalq_Ξ(x) > 0(Appendix E.1 Eq. (7); Appendix E.2 notes the training âensuresâ this, though the mechanism for guaranteeing strict support is not elaborated in the provided excerpt). -
Core configurations / hyperparameters explicitly given (and what is missing).
- Hardware: experiments run on 8Ă NVIDIA A100 GPUs (80 GB each) (Appendix A).
- Code drafter model: a 0.5B-parameter discrete diffusion drafter derived by modifying
open-dCoder/open-dLLM(Appendix A; Section 4.1 âModelsâ). - Math drafter model: a 0.5B-parameter drafter initialized from
Qwen2.5-0.5B-Instructand converted to a diffusion model with a diffusion decoding head (Section 4.1; Appendix A). - Optimization (code tasks):
- Stage I:
AdamW, learning rate1e-4, 1 epoch over the code training corpus (Appendix A). - Stage II:
AdamW, learning rate5e-5, 1 epoch on 100k examples subset (Appendix A).
- Stage I:
- Optimization (math tasks):
- Pretraining/conversion training: 40 epochs on UltraChat to obtain a partially converged dLLM (Section 4.6; Appendix A).
- Stage I:
AdamW, learning rate1e-4, 5 epochs (Appendix A). - Stage II:
AdamW, learning rate1e-4, 1 epoch (Appendix A).
- Stage II masking hyperparameters:
R ~ Uniform(1, 96)and weight baseα(Section 3.1.2; Figure 6). - Block size
k: used in inference (Algorithm 1); sensitivity to block size is explored fork â {4, 8, 16, 32}(Figure 8). - Missing (not provided in the excerpt):
- Target model decoding settings beyond temperature (e.g., top-p/top-k).
- Tokenizer details, context window length, and architecture specifics (layers/hidden size/heads) for the dLLM and targets.
- Total training tokens, exact batch sizes, weight decay, and learning rate schedule.
4. Key Insights and Innovations¶
- (1) Diffusion-as-drafter for speculative decoding with one-step block drafting
- Novelty: DEER uses a discrete-space dLLM as the sole drafter (Contribution bullets; Section 3), rather than an AR drafter or hybrid.
-
Significance: This is positioned as addressing both AR-drafter bottlenecks simultaneously:
- Parallel block generation (efficiency),
- Reduced uncertainty accumulation via within-block proposal independence (acceptance length).
-
(2) Two-stage Diffusion-to-AR (
D2A) alignment to fix distribution mismatch - Stage I (Section 3.1.1): âAR-style continuation distillationâ transforms the dLLMâs training into prefix-conditioned continuation by truncating teacher answers and inserting
[SEP]. - Stage II (Section 3.1.2): âScribe refinementâ focuses accuracy near the verification boundary with weighted suffix masking and exponential weights.
-
Significance: Experiments show Stage II increases accepted-token length on harder benchmarks (Table 3), which directly affects speedup.
-
(3) Empirical acceptance-length gains to long blocks (up to 32 tokens)
- DEER reaches maximum accepted lengths of 32 tokens across multiple target model scales (Table 4).
-
Compared baseline: EAGLE-3âs max accepted length is 7â8 tokens in the same table.
-
(4) Reported emergent behavior: âreliable block regenerationâ
- The paper claims the aligned dLLM can repeatedly regenerate partially masked suffix blocks coherently (âreliable block regenerationâ; contribution bullets; Section 4.5; Figure 5).
- This is presented as a qualitative capability enabled by the training pipeline rather than architectural changes (Section 4.5).
5. Experimental Analysis¶
- Evaluation methodology (Section 4.1).
- Code generation:
- Training data:
OpenCodeInstruct(Section 4.1). - Evaluation benchmarks:
HumanEval,MBPP,LiveCodeBench,CodeAlpacaPy(Python subset) (Section 4.1).
- Training data:
- Math reasoning:
- Training data:
UltraChatandShareGPT(Section 4.1). - Evaluation benchmarks:
GSM8K,Math500,Minerva Math(Section 4.1).
- Training data:
- Baselines:
Medusa,Hydra,EAGLE-3(Section 4.1). - Metrics:
Speedup ratio: end-to-end speedup vs standard AR decoding.Average acceptance length (Ï): accepted tokens per speculative cycle (Section 4.1).
- Temperatures reported:
- Temperature
0.6(Table 1; and math Table 6). - Temperature
0(Table 2; Figure 1).
- Temperature
-
KV cache conditions:
- Many code benchmark results are âwith KV cacheâ (Tables 1â2; Section 4.2.1).
- Batch inference study is explicitly without KV cache for both AR and DEER due to lack of mature dLLM KV-cache deployment (Section 4.4; Table 5; Appendix F).
-
Main quantitative results (code benchmarks).
Temperature = 0, with KV cache (Table 2):
- Qwen3-30B-A3B on HumanEval:
- DEER: 5.54Ă speedup, Ï = 6.58
- EAGLE-3: 2.41Ă speedup, Ï = 3.21
- This is the headline result also echoed in the abstract/intro.
- Mean across the four code benchmarks (MBPP, CodeAlpacaPy, HumanEval, LiveCodeBench):
- For Qwen3-30B-A3B: DEER mean 4.04Ă, Ï = 5.03 vs EAGLE-3 mean 2.21Ă, Ï = 3.05 (Table 2).
- For smaller targets, DEER similarly improves mean speedup and Ï:
- Qwen3-14B: mean 2.98Ă, Ï = 4.82 vs EAGLE-3 mean 2.39Ă, Ï = 3.54 (Table 2).
- Qwen3-8B: mean 2.83Ă, Ï = 4.61 vs EAGLE-3 mean 2.48Ă, Ï = 3.45 (Table 2).
- Qwen3-4B: mean 2.77Ă, Ï = 4.61 vs EAGLE-3 mean 2.40Ă, Ï = 3.31 (Table 2).
Temperature = 0.6, with KV cache (Table 1):
- For Qwen3-30B-A3B, mean DEER speedup 3.62Ă, Ï = 4.45 vs EAGLE-3 mean 2.01Ă, Ï = 2.40 (Table 1).
- This shows DEERâs advantage persists under nonzero temperature, though absolute values differ.
- Acceptance distribution and maximum accepted lengths.
- Max accepted length is 32 tokens for DEER across Qwen3 targets; EAGLE-3 is 7â8 tokens (Table 4).
- The fraction of âlong acceptsâ (â„ 8 tokens) is reported around 15.8%â16.5% across Qwen3-4B/8B/14B in Figure 4.
-
Appendix B.1 / Figure 7 reports a âlong-block resurgence effectâ: counts roughly decay exponentially up to about 30 tokens, then increase near the maximum range (the explanation given is reduced exposure to left-to-right error propagation when later draft tokens are not conditioned on earlier draft tokens).
-
Ablation: impact of Stage II refinement.
- On
Qwen3-30B-A3B, Stage II increases average accepted length across all four code benchmarks (Table 3):MBPP: 4.74 â 4.87CodeAlpacaPy: 3.47 â 4.04HumanEval: 5.38 â 6.58LiveCodeBench: 3.87 â 5.03
-
The largest gains are on the harder benchmarks (HumanEval, LiveCodeBench), consistent with the claim that boundary accuracy is crucial for acceptance (Section 4.3.1).
-
Sensitivity analysis.
- Stage II weight base
αsensitivity (Figure 6):α = 1.01shows stable decreasing loss.α = 1.02is noisier with upward drift.α = 1.05diverges early.
-
Block size
ksensitivity (Appendix B.2; Figure 8):- Average acceptance length increases with block size from
4to32, with diminishing returns. - Larger target backbones (e.g.,
Qwen3-30B-A3B) show higher acceptance than smaller ones across the same block sizes.
- Average acceptance length increases with block size from
-
Batch inference scalability (Section 4.4; Table 5).
- Measured on
HumanEvalthroughput (tokens/s), without KV cache for both methods:- Batch 8:
AR38.35 vsDEER159.87 tokens/s. - Batch 16:
AR49.76 vsDEER175.66 tokens/s.
- Batch 8:
-
The paper attributes stronger scaling to DEERâs parallel drafting and better GPU utilization at larger batch sizes (Section 4.4).
-
Math reasoning results (Section 4.6; Table 6).
- Using
Qwen3-30B-A3Bat temperature0.6, DEER improves speedup over EAGLE-3 on all three math benchmarks (Table 6):Math500: 1.89Ă â 2.12ĂGSM8K: 1.92Ă â 2.23ĂMinerva Math: 1.91Ă â 2.02Ă- Mean: 1.91Ă â 2.12Ă
-
Table 6 labels the second metric as âKendallâs Ï correlation,â but the main text interprets these values as acceptance length
Ï(e.g., âacceptance length from 2.04 â 2.45â); this is an internal inconsistency in the provided content. -
Do the experiments support the claims?
- The core claimâlonger accepted blocks and higher speedups vs AR-drafter baselinesâis directly supported by:
- Large deltas in
Ïand speedup in Tables 1â2, - Max acceptance length jump in Table 4,
- Stage II ablation in Table 3.
- Large deltas in
- The claim that diffusion drafting avoids uncertainty accumulation is supported indirectly by acceptance-length distributions (Figure 4; Figure 7) and the conceptual independence argument (Section 3.2), but the excerpt does not include a direct quantitative measure of uncertainty accumulation (e.g., KL vs depth) beyond the narrative.
6. Limitations and Trade-offs¶
- KV-cache and deployment maturity for dLLMs
- The paper explicitly notes there is âno mature framework for efficient dLLM+KV-cache deployment,â limiting integration with common serving stacks; batch experiments are run without KV cache (Section 4.4; Appendix F).
-
This means the best-case production performance might depend on future infrastructure improvements (Appendix F).
-
Training stability sensitivity in Stage II
- Stage IIâs exponential weighting has a narrow stability window:
α = 1.05diverges in the shown experiment (Figure 6). -
This introduces hyperparameter fragility and potentially higher tuning cost.
-
Draft quality depends on alignment; diffusion models are not naturally prefix continuations
- The approach requires a specific training pipeline to mitigate distribution mismatch (Section 3.1).
-
For domains without good pretrained diffusion LMs, the paper had to construct one (math setting) and reports it is âonly partially convergedâ (Section 4.6), which likely caps achievable acceptance lengths/speedups there.
-
Average acceptance length is still much smaller than maximum
-
While maximum accepted length reaches 32 (Table 4), reported average
Ïvalues on code benchmarks are typically around ~4â6.6 depending on model/temperature (Tables 1â2), indicating long accepts exist but are not the dominant case. -
Incomplete reporting of model/training details in the provided excerpt
-
Important reproducibility details (tokenizer, context length, batch size, LR schedule, total tokens) are not included here, limiting the ability to fully audit or replicate.
-
Comparability vs baselines can be constrained by feasibility
- The paper reports OOM for Medusa and Hydra at Qwen3-14B under default configs, so large-scale comparisons are incomplete there (Appendix D; Table 8).
7. Implications and Future Directions¶
- How this changes the landscape
- It makes a concrete case that diffusion-based LMs can be practical inference accelerators for AR LLMs when used as drafters, not just as alternative generative models.
-
The acceptance-length improvements suggest that breaking the AR drafterâs left-to-right dependency can be a first-order factor in speculative decoding speedups (Section 3.2; Tables 1â2; Table 4).
-
Follow-up research enabled/suggested (grounded in the paperâs gaps)
- Better dLLM KV-cache integration: Appendix F points to ongoing work and suggests DEER could benefit substantially once diffusion KV caching is available in mainstream serving frameworks.
- Stronger diffusion pretraining for non-code domains: Section 4.6 shows acceleration even with a partially converged math dLLM, implying more convergence could further increase
Ïand speedup. -
Understanding âlong-block resurgenceâ: Appendix B.1 introduces this empirical phenomenon (Figure 7) and proposes an explanation; it invites deeper analysis or modeling.
-
Practical applications / downstream use cases
-
Any setting dominated by AR decoding latency where you can afford an extra small model:
- Code generation assistants (HumanEval/MBPP/LiveCodeBench-style workloads; Section 4.2),
- Batched inference scenarios where parallel drafting improves utilization (Section 4.4; Table 5).
-
Repro/Integration Guidance (when to prefer this)
- Prefer
DEERover AR-drafter speculative decoding when:- You can deploy a ~0.5B parameter drafter alongside a large AR model (Appendix A),
- You want longer accepted runs (higher
Ï) and you can exploit parallelism.
- Training cost context (Appendix D; Table 8):
- On
Qwen3-8B, DEER drafter fine-tuning is reported as ~240 GPU-hours, compared to 696 for EAGLE-3 and higher for Hydra (noting these are âapproximate wall-clockâ under their setup).
- On
- Drafter size is not the sole explanation in their analysis:
- DEER uses a 470M-parameter drafter across targets, while EAGLE-3 draft heads range from 140Mâ610M depending on target (Appendix C; Table 7), yet DEER still achieves higher
Ïand max accept lengths (Tables 2 and 4).
- DEER uses a 470M-parameter drafter across targets, while EAGLE-3 draft heads range from 140Mâ610M depending on target (Appendix C; Table 7), yet DEER still achieves higher