重要性探究：对比损失在多模态学习中的作用 (On the Importance of Contrastive Loss in Multimodal Learning)

Recently, contrastive learning approaches (e.g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e.g., image and its caption) of the same data point while keeping the representations of different data points away from each other. However, from a theoretical perspective, it is unclear how contrastive learning can learn the representations from different views efficiently, especially when the data is not isotropic. In this work, we analyze the training dynamics of a simple multimodal contrastive learning model and show that contrastive pairs are important for the model to efficiently balance the learned representations. In particular, we show that the positive pairs will drive the model to align the representations at the cost of increasing the condition number, while the negative pairs will reduce the condition number, keeping the learned representations balanced.

翻译：近来, 对比学习方法（例如，CLIP(Radford等，2021)）在多模态学习中获得了巨大的成功，其中模型试图最小化相同数据点的不同视图之间的表征距离（例如，图像及其标题），同时保持不同数据点的表征彼此分开。然而，从理论角度看，我们不清楚对比学习如何有效地从不同的视角学习表征，特别是当数据是非等向的时。在这项工作中，我们分析了一个简单的多模态对比学习模型的训练动态，并展示了对比对对模型有效平衡学习表征至关重要。特别地，我们证明了正样本将带动模型为了对齐学习表征而增加条件数，而负样本将降低条件数，以保持所学表征的平衡。

相关内容

多模态学习

关注 44

现实世界中的信息通常以不同的模态出现。例如，图像通常与标签和文本解释联系在一起;文本包含图像以便更清楚地表达文章的主要思想。不同的模态由迥异的统计特性刻画。例如，图像通常表示为特征提取器的像素强度或输出，而文本则表示为离散的词向量。由于不同信息资源的统计特性不同，发现不同模态之间的关系是非常重要的。多模态学习是一个很好的模型，可以用来表示不同模态的联合表示。多模态学习模型也能在观察到的情况下填补缺失的模态。多模态学习模型中，每个模态对应结合了两个深度玻尔兹曼机（deep boltzmann machines）.另外一个隐藏层被放置在两个玻尔兹曼机上层，以给出联合表示。

【CVPR2022】通过初始阶段的表征去相关性来提升类增量学习

专知会员服务

18+阅读 · 2022年4月25日

【CVPR2022】视频对比学习的概率表示，Probabilistic Representations for Video Contrastive Learning

专知会员服务

16+阅读 · 2022年4月11日

【ICLR2022】Transformers亦能贝叶斯推断

专知会员服务

25+阅读 · 2021年12月23日

【ICCV2021】参数化对比学习

专知会员服务

33+阅读 · 2021年7月27日