This paper proposes a method for representation learning of multimodal data using contrastive losses. A traditional approach is to contrast different modalities to learn the information shared between them. However, that approach could fail to learn the complementary synergies between modalities that might be useful for downstream tasks. Another approach is to concatenate all the modalities into a tuple and then contrast positive and negative tuple correspondences. However, that approach could consider only the stronger modalities while ignoring the weaker ones. To address these issues, we propose a novel contrastive learning objective, TupleInfoNCE. It contrasts tuples based not only on positive and negative correspondences but also by composing new negative tuples using modalities describing different scenes. Training with these additional negatives encourages the learning model to examine the correspondences among modalities in the same tuple, ensuring that weak modalities are not ignored. We provide a theoretical justification based on mutual information for why this approach works, and we propose a sample optimization algorithm to generate positive and negative samples to maximize training efficacy. We find that TupleInfoNCE significantly outperforms the previous state of the arts on three different downstream tasks.
翻译:本文建议采用一种方法,利用对比性损失来代表多式数据学习; 一种传统做法是比较不同的方式,以了解它们之间共享的信息; 但是,这种办法可能无法了解对下游任务可能有用的不同方式之间的互补协同作用; 另一种做法是把所有模式合并成一个图例,然后对正和负图例进行对比; 但是,这种办法只考虑较强的模式,而忽视较弱的模式; 为了解决这些问题,我们提出了一个新的对比性学习目标,图普利InfoNCE 。 它与图普利InfoNCE 相比,它不仅基于正面和负面的通信,而且通过使用描述不同场景的模式来形成新的负面图例。 使用这些额外的负面培训鼓励学习模式来审查同一图例中各模式之间的对应关系,确保薄弱的模式不被忽略。 我们根据相互信息,为这一方法奏效提供了理论上的理由,我们建议一种抽样优化算法,以产生积极和负面的样本,以最大限度地提高培训效果。 我们发现图普利因弗肯克公司在三个不同的下游任务上明显地优于艺术的前状态。