Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.
翻译:视觉-语言模型(VLMs)在现实世界中的部署受到高计算需求的阻碍,因为现有架构对所有令牌进行统一处理,效率低下。我们引入了自适应令牌剪枝(ATP),这是一种动态推理机制,仅根据上下文相关性保留信息量最大的令牌。ATP在视觉-语言接口处运行,通过结合ViT CLS注意力(模态内显著性)和CLIP文本-图像相似度(模态间相关性)的混合重要性评分,为大型语言模型(LLM)保留前K个令牌。与静态压缩不同,ATP无需修改骨干网络即可适应每个输入。ATP被设计为一个轻量级的门控模块,与BLIP-2、LLaVA和Flamingo等流行骨干网络兼容。在VQAv2、GQA和COCO数据集上的初步评估表明,ATP将推理FLOPs降低了约40%,端到端延迟实现了约1.5倍的加速,且精度损失可忽略不计(小于1%)。定性分析表明,ATP保留了视觉基础并增强了可解释性。除了效率之外,我们还研究了在数据损坏情况下的鲁棒性;观察结果表明,自适应剪枝抑制了虚假相关性,提高了稳定性。这些发现意味着资源受限的推理与模型可靠性并非相互竞争的目标。最后,我们讨论了ATP在高效多模态边缘计算流水线中的作用。