Language-supervised vision models have recently attracted great attention in computer vision. A common approach to build such models is to use contrastive learning on paired data across the two modalities, as exemplified by Contrastive Language-Image Pre-Training (CLIP). In this paper, under linear representation settings, (i) we initiate the investigation of a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) including CLIP loss and show its connection to singular value decomposition (SVD). Namely, we show that each step of loss minimization by gradient descent can be seen as performing SVD on a contrastive cross-covariance matrix. Based on this insight, (ii) we analyze the performance of MMCL. We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality even under the presence of wrongly matched pairs. This characterizes the robustness of MMCL to noisy data. Furthermore, when we have access to additional unpaired data, (iii) we propose a new MMCL loss that incorporates additional unpaired datasets. We show that the algorithm can detect the ground-truth pairs and improve performance by fully exploiting unpaired datasets. The performance of the proposed algorithm was verified by numerical experiments.
翻译:最近,语言监督的视觉模型在计算机视野中引起了极大关注。 构建这种模型的共同方法就是在两种模式的配对数据上使用对比性学习,例如,在培训前语言图像的对比性(CLIP)中,在本文件中,在线性表述设置下, (一) 我们开始调查包括CLIP损失在内的多种对比学习的非线性损失功能的一般类别,并显示其与单值分解(SVD)的联系。 也就是说,我们表明,通过梯度下降而尽量减少损失的每一步都可以被看作是在对比性交叉变量矩阵上进行SVD。 基于这一洞察, (二) 我们分析了MMCL的性能。 我们在数量上表明,MMCL的特质学习能力可以比适用于每一种模式的非线性对比性对比性学习能力要好,即使存在不匹配的对配对。 这体现了MMCL对噪音数据的坚固性能。 此外,当我们能够获取更多的未调节的数据时,我们建议采用新的MMCL损失模式进行新的MRC损失, 通过我们可以检测更多的数字演算结果。