Despite achieving remarkable performance on various vision-language tasks, Transformer-based pretrained vision-language models (VLMs) still suffer from efficiency issues arising from long inputs and numerous parameters, limiting their real-world applications. However, the huge computation is redundant for most samples and the degree of redundancy and the respective components vary significantly depending on tasks and input instances. In this work, we propose an adaptive acceleration method SmartTrim for VLMs, which adjusts the inference overhead based on the complexity of instances. Specifically, SmartTrim incorporates lightweight trimming modules into the backbone to perform task-specific pruning on redundant inputs and parameters, without the need for additional pre-training or data augmentation. Since visual and textual representations complement each other in VLMs, we propose to leverage cross-modal interaction information to provide more critical semantic guidance for identifying redundant parts. Meanwhile, we introduce a self-distillation strategy that encourages the trimmed model to be consistent with the full-capacity model, which yields further performance gains. Experimental results demonstrate that SmartTrim significantly reduces the computation overhead (2-3 times) of various VLMs with comparable performance (only a 1-2% degradation) on various vision-language tasks. Compared to previous acceleration methods, SmartTrim attains a better efficiency-performance trade-off, demonstrating great potential for application in resource-constrained scenarios.
翻译:暂无翻译