Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs and mixture-of-experts (MoE) models. Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding to remove substantial computational redundancy in LLMs. This work systematically adopts different sparse methods on the verification stage of the speculative decoding and identifies structured redundancy across multiple dimensions. Based on these observations, we propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost. The framework further incorporates an inter-draft token and inter-layer retrieval reuse strategy to further reduce redundant computation without introducing additional training. Extensive experiments across summarization, question answering, and mathematical reasoning datasets demonstrate that the proposed methods achieve favorable efficiency-accuracy trade-offs, while maintaining stable acceptance length.
翻译:推测解码通过并行验证多个草稿令牌来加速自回归语言模型的推理。然而,验证阶段常常成为主要的计算瓶颈,尤其对于长上下文输入和专家混合(MoE)模型。现有的稀疏化方法主要针对标准的逐令牌自回归解码而设计,旨在移除大型语言模型中的大量计算冗余。本研究系统地将不同的稀疏方法应用于推测解码的验证阶段,并识别出跨多个维度的结构化冗余。基于这些观察,我们提出了一种稀疏验证框架,该框架在验证阶段联合稀疏化注意力、前馈网络(FFN)和MoE组件,以降低主导性的计算成本。该框架进一步结合了草稿令牌间和层间检索重用策略,在不引入额外训练的情况下进一步减少冗余计算。在摘要、问答和数学推理数据集上进行的大量实验表明,所提出的方法实现了有利的效率-准确性权衡,同时保持了稳定的接受长度。