With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.
翻译:随着视觉语言模型(VLMs)的快速发展及其应用需求的日益增长,对图像输入进行高效压缩变得愈发重要。现有的视觉语言模型主要处理和理解高比特率压缩图像,而其对低比特率压缩图像的理解能力迄今尚未得到充分探索。本文首次引入了一个全面的基准测试,用于评估视觉语言模型处理压缩图像的能力。该基准涵盖了多种广泛使用的现有图像编解码器以及多样化的任务集,包含超过一百万张压缩图像。接下来,我们通过将性能差距的来源归类为 a) 压缩过程中的信息损失和 b) 视觉语言模型的泛化失败,来分析性能差距的根源。我们通过具体示例对这些差距进行了可视化,并发现对于压缩图像,只有泛化差距是可以被缓解的。最后,我们提出了一种通用的视觉语言模型适配器,以增强模型对现有编解码器压缩图像的处理性能。实验结果表明,单个适配器可以将视觉语言模型在不同编解码器和比特率压缩图像上的性能提升10%至30%。我们相信,我们的基准测试和增强方法提供了有价值的见解,并有助于弥合视觉语言模型与压缩图像之间的差距。