Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP), and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP.
翻译:在这项工作中,我们探讨了非争议性语言模拟培训前培训(nCLIP)的有效性,并研究了视觉自我监督模式中显示的良好特性;我们从经验中观察到,非争议性目标鼓励了代表性学习,但在零分识别下表现充分不足;根据上述研究,我们进一步引入了xCLIP,这是一个将CLIP和NCLIP相结合的多任务框架,并显示,NCLIP帮助了CLIP加强特征语义表达学。让xCLIP在两个世界中都能享受最佳效果:零光传输和代表学习的优异性能。系统评价涉及广泛的下游任务,包括零光分类、外部分类、检索、视觉代表学习和文本表现的一致性能学习。