Contrastive Language Image Pretraining (CLIP) received widespread attention since its learned representations can be transferred well to various downstream tasks. During CLIP training, the InfoNCE objective aims to align positive image-text pairs and separate negative ones. In this paper, we show a representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. We introduce Prototypical Contrastive Language Image Pretraining (ProtoCLIP) to enhance such grouping by boosting its efficiency and increasing its robustness against modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. We further propose Prototypical Back Translation (PBT) to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. PBT also enables us to introduce additional external teachers with richer prior knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On larger YFCC dataset, ProtoCLIP matches the performance of CLIP with 4$\times$fewer pretraining epochs. Codes are available at https://github.com/megvii-research/protoclip.
翻译:语言对比图像预设培训(CLIP)受到广泛关注,因为其所学的表述方法可以顺利地转移到各种下游任务中。在CLIP培训中,InfoNCE的目标旨在将正图像文本对配和不同的负对相匹配。在本文中,我们展示了一个在此过程中的代表组合效应:InfoNCE的目标间接地通过在模式内随机生成的定位器,将相似的表述方法放在一起。我们引入了Protoclexive语言图像预设培训(ProtoCLIP),以通过提高效率和针对模式间差异增强强力来加强这种组合。具体地说,ProtoCLIP在图像和文本空间之间建立了原型差别,从而有效地转移了更高层次的结构知识。我们进一步建议Protoclect 后翻译(PBT) 将代表组合从代表组合中分离出来,从而在大模式缺口下有效学习有意义的表述。 PBBT还使我们能够引入更多具有更丰富先前知识的外部教师。ProtoCLIP经过在线的学习策略培训,这样可以将其扩大为无限数量的数据。我们ProtoCLIPROPIPIPO/CRIPO的升级升级升级数据升级。我们用更大规模的升级的升级数据。