Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
翻译:推测解码是一种广泛应用于加速大型语言模型推理的技术,但其在视觉语言模型中的应用仍待深入探索,现有方法仅能实现有限的加速效果(<1.5倍)。随着多模态能力成为大规模模型的核心,这一差距日益凸显。我们假设大型视觉语言模型能够在不损害文本理解的前提下,逐层有效过滤冗余图像信息,而较小的草稿模型则难以做到这一点。为此,我们提出了视觉感知推测解码,这是一个专为视觉语言模型设计的新颖框架。ViSpec采用轻量级视觉适配器模块,将图像令牌压缩为紧凑表示,并将其无缝集成到草稿模型的注意力机制中,同时保留原始图像位置信息。此外,我们为每个输入图像提取一个全局特征向量,并将该特征增强到所有后续文本令牌中,以提升多模态连贯性。为克服具有长助手响应的多模态数据集稀缺的问题,我们通过重新利用现有数据集,并使用修改提示的目标视觉语言模型生成扩展输出,精心策划了一个专门的训练数据集。我们的训练策略降低了草稿模型利用直接访问目标模型隐藏状态的风险,若仅基于目标模型输出进行训练,这种风险可能导致捷径学习。大量实验验证了ViSpec的有效性,据我们所知,该框架首次在视觉语言模型推测解码中实现了显著加速。代码发布于 https://github.com/KangJialiang/ViSpec。