Contrastive learning is a powerful self-supervised learning method, but we have a limited theoretical understanding of how it works and why it works. In this paper, we prove that contrastive learning with the standard InfoNCE loss is equivalent to spectral clustering on the similarity graph. Using this equivalence as the building block, we extend our analysis to the CLIP model and rigorously characterize how similar multi-modal objects are embedded together. Motivated by our theoretical insights, we introduce the kernel mixture loss, incorporating novel kernel functions that outperform the standard Gaussian kernel on several vision datasets.
翻译:对比学习是一种强大的自监督学习方法,但我们对其如何工作以及为什么有效的理论理解有限。本文证明了使用标准InfoNCE损失的对比学习等价于相似性图谱上的谱聚类。利用这个等价关系作为基本模块,我们将分析扩展到了CLIP模型,并严格描述了多模态对象嵌入在一起的相似度。受我们的理论洞察力的启发,我们引入了核混合损失,加入了优于几个视觉数据集上标准高斯核的新型核函数。