Contrastive Language-Image pre-training (CLIP) learns rich representations via readily available supervisions of natural language. It could improve general performance on downstream vision tasks, including but not limited to zero-shot, long tail, segmentation, retrieval, caption and video. However, to the best of our knowledge, the visual interpretability of CLIP has not been studied yet. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and presenting erroneous visualization against human understanding. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. To correct and boost the visualization results, we propose the Masked Max Pooling, with attention map from the self-supervised image encoder. Meanwhile, interpretability task and recognition task require different representations. To address the problem, we propose the dual projections to cater this requirement. We integrate above methods as Interpretable Contrastive Language-Image pre-training (ICLIP). And experiments suggest ICLIP greatly improves the interpretability. For example, the nontrivial improvements are $32.85\%$ and $49.10\%$, respectively, on VOC 2012 dataset.
翻译:培训前的对比语言图像(CLIP)通过对自然语言的随时可用的监督来学习丰富的表现形式;它可以改进下游愿景任务的一般表现,包括但不限于零射、长尾尾、分割、检索、字幕和视频;然而,据我们所知,尚未研究CLIP的视觉可解释性;为提供其预测的视觉解释,我们提议了图像-图象相似性地图(ITSM)。在此基础上,我们令人惊讶地发现,CLIP更喜欢背景区域,而不是前方区域,并会对人类的理解产生错误的视觉化。实验性,我们发现魔鬼在集合部分,不适当的集合方法导致被称为语义转变的现象。为了纠正和促进视觉化结果,我们提议了蒙蔽的Max Pool(Musing),关注自上图像编码的地图。与此同时,解释性任务和识别任务需要不同的描述。为了解决问题,我们提议双重预测来满足这一需求。我们把以上方法整合为互可调用$IMOO, 2012年的40级前数据解释性(ICLIP)分别建议改进。