Contrastive Language Image Pretraining (CLIP) received widespread attention since its learned representations can be transferred well to various downstream tasks. During CLIP training, the InfoNCE objective aims to align positive image-text pairs and separate negative ones. In this paper, we show a representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. We introduce Prototypical Contrastive Language Image Pretraining (ProtoCLIP) to enhance such grouping by boosting its efficiency and increasing its robustness against modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. We further propose Prototypical Back Translation (PBT) to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. PBT also enables us to introduce additional external teachers with richer prior knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. Combining the above novel designs, we train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. Codes are available at https://github.com/megvii-research/protoclip.
翻译:语言差异性图像预设培训(CLIP)受到广泛的关注,因为其所学的表述方法可以顺利地转移到各种下游任务中。在CLIP培训期间,InfoNCE目标旨在将正面的图像-文本配对和不同的负数组合起来。在本文中,我们展示了在这一过程中的一种代表组合效应:InfoNCE目标通过在模式内随机出现的锚定,间接地将相似的表述组合在一起。我们引入了超典型的相异语言图像预设培训(ProtoCLIP),以通过提高效率和增强对模式差异的稳健性来增强这种组合。具体地说,ProtoCLIP在图像和文本空间之间设置了原型的差别,从而有效地转移了更高层次的结构知识。我们进一步建议Protogramme 后翻译(PBT) 将代表组合从代表组合中分离出来,从而在大模式缺口下有效学习有意义的表述。PBBT还使我们能够引入更多具有更丰富知识的外部教师。ProtoCLIPIPA培训策略,从而将其扩大为无限数量,从而可以将其扩大到扩大至无限的图像/在线版本设计,我们正在整合了我们完成了的版本的版本/升级的版本。