快速失败，赢得大胜：基于扩散大语言模型的推测解码中草稿策略的再思考 (Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs)

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.

翻译：扩散大语言模型（dLLMs）能够实现快速、并行的令牌生成，但其独立使用受到固有的效率-质量权衡问题的困扰。我们证明，如果应用得当，dLLMs的特性实际上可以成为推测解码中，配合自回归（AR）验证器使用的草稿模型的优势。我们的核心见解是，dLLM通过并行解码获得的速度，极大地降低了代价高昂的拒绝风险，从而提供了一种实用的机制，可以有效实现（以往难以企及的）长草稿，而这正是推测解码获得大幅加速的关键。我们提出了FailFast，这是一个基于dLLM的推测解码框架，通过动态调整其推测长度来实现这一方法。它在难以推测的区域通过投入最少的计算来“快速失败”，从而缩短推测延迟；在较容易的区域则通过积极延长草稿长度来“赢得大胜”，以减少验证延迟（在许多情况下，可以一次推测并接受多达70个令牌！）。无需任何微调，FailFast即可实现对AR LLMs的无损加速，相比原始解码实现了高达4.9倍的加速，相比最佳的原生dLLM草稿模型实现了1.7倍的加速，并且在多种模型和工作负载上相比EAGLE-3实现了1.4倍的加速。我们在 https://github.com/ruipeterpan/failfast 开源了FailFast。