Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal large vision model that has demonstrated significant benefits for downstream tasks, including many zero-shot learning and text-guided vision tasks. However, we notice some severe problems regarding the model's explainability, which undermines its credibility and impedes related tasks. Specifically, we find CLIP prefers the background regions than the foregrounds according to the predicted similarity map, which contradicts human understanding. Besides, there are obvious noisy activations on the visualization results at irrelevant positions. To address these two issues, we conduct in-depth analyses and reveal the reasons with new findings and evidences. Based on these insights, we propose the CLIP Surgery, a method that enables surgery-like modifications for the inference architecture and features, for better explainability and enhancement in multiple open-vocabulary tasks. The proposed method has significantly improved the explainability of CLIP for both convolutional networks and vision transformers, surpassing existing methods by large margins. Besides, our approach also demonstrates remarkable improvements in open-vocabulary segmentation and multi-label recognition tasks. For examples, the mAP improvement on NUS-Wide multi-label recognition is 4.41% without any additional training, and our CLIP Surgery surpasses the state-of-the-art method by 8.74% at mIoU on Cityscapes open-vocabulary semantic segmentation. Furthermore, our method benefits other tasks including multimodal visualization and interactive segmentation like Segment Anything Model (SAM). The code is available at https://github.com/xmed-lab/CLIP_Surgery
翻译:对比语言-图像预训练(CLIP)是一种强大的多模态大视野模型,已经证明在下游任务中具有显著的优势,包括许多零样本学习和文本引导的视觉任务。然而,我们发现模型可解释性存在严重问题,这会削弱其可信度,阻碍相关任务的开展。具体地,我们发现CLIP更喜欢背景区域而不是前景区域,这与人类理解相矛盾。此外,在不相关的位置上,可视化结果上存在明显的噪声激活。为解决这两个问题,我们进行了深入的分析,揭示了原因和新发现和证据。基于这些见解,我们提出了CLIP手术,一种可以对推理架构和特征进行类似手术的修改,以提高多个开放词汇任务的可解释性和性能的方法。该方法显著提高了CLIP的解释性能,包括卷积网络和视觉变形器,超过了现有方法的大幅度。此外,我们的方法在开放词汇分割和多标签识别任务中也表现出了显着的改进。例如,在不进行任何额外训练的情况下,NUS-Wide多标签识别的mAP提高了4.41%,我们的CLIP手术在Cityscapes开放词汇语义分割上的mIoU超过了现有方法8.74%。此外,我们的方法也使得其他任务受益,包括多模式可视化和交互分割,如Segment Anything Model (SAM)。代码可在https://github.com/xmed-lab/CLIP_Surgery中获得。