ViSpec：通过视觉感知推测解码加速视觉语言模型 (ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding)

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.

翻译：推测解码是一种广泛应用于加速大型语言模型推理的技术，但其在视觉语言模型中的应用仍待深入探索，现有方法仅能实现有限的加速效果（<1.5倍）。随着多模态能力成为大规模模型的核心，这一差距日益凸显。我们假设大型视觉语言模型能够在不损害文本理解的前提下，逐层有效过滤冗余图像信息，而较小的草稿模型则难以做到这一点。为此，我们提出了视觉感知推测解码，这是一个专为视觉语言模型设计的新颖框架。ViSpec采用轻量级视觉适配器模块，将图像令牌压缩为紧凑表示，并将其无缝集成到草稿模型的注意力机制中，同时保留原始图像位置信息。此外，我们为每个输入图像提取一个全局特征向量，并将该特征增强到所有后续文本令牌中，以提升多模态连贯性。为克服具有长助手响应的多模态数据集稀缺的问题，我们通过重新利用现有数据集，并使用修改提示的目标视觉语言模型生成扩展输出，精心策划了一个专门的训练数据集。我们的训练策略降低了草稿模型利用直接访问目标模型隐藏状态的风险，若仅基于目标模型输出进行训练，这种风险可能导致捷径学习。大量实验验证了ViSpec的有效性，据我们所知，该框架首次在视觉语言模型推测解码中实现了显著加速。代码发布于 https://github.com/KangJialiang/ViSpec。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日