Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for image-text matching because of its holistic use of natural language supervision that covers large-scale, open-world visual concepts. However, it is still challenging to adapt CLIP to compositional image and text matching -- a more challenging image and matching mask requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically assess the contribution of each entity when performing image and text matching. Experiments on compositional image-text matching on SVO and ComVG and general image-text retrieval on Flickr8K demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP even without further training or fine-tuning of CLIP.
翻译:语言与图像对比前训练(CLIP)显示,图像文本匹配的准确性表现非常零, 因为它全面使用包含大型、 开放世界视觉概念的自然语言监督, 但是,让 CLIP 配置成成成像和文本匹配 -- -- 一个更具挑战性的图像和匹配面罩, 需要对成像词概念和视觉构件进行示范理解。 在零发图像和文本匹配中,我们从因果关系的角度来研究问题: 个别实体错误的语义基本上是造成匹配失败的混搭者。 因此, 我们提出一个新的无培训的 CLIP 成像模型(ComCLIP) 。 ComCLIP 将输入的图像分解成主题、 对象和行动子映像, 并折叠成 CLIP 的愿景编码和文字编码, 以演化成对成成成像和Flickr C-LIIP 预成像时, CLIIP 和C- CVIFL 的缩影像校准能力评估。