Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs from sets of negative samples. Recently, the principle has also been used to learn cross-modal embeddings for video and text, yet without exploiting its full potential. In particular, previous losses do not take the intra-modality similarities into account, which leads to inefficient embeddings, as the same content is mapped to multiple points in the embedding space. With CrossCLR, we present a contrastive loss that fixes this issue. Moreover, we define sets of highly related samples in terms of their input embeddings and exclude them from the negative samples to avoid issues with false negatives. We show that these principles consistently improve the quality of the learned embeddings. The joint embeddings learned with CrossCLR extend the state of the art in video-text retrieval on Youcook2 and LSMDC datasets and in video captioning on Youcook2 dataset by a large margin. We also demonstrate the generality of the concept by learning improved joint embeddings for other pairs of modalities.
翻译:对比性学习让我们能够灵活地定义强大的损失,通过对比从一组负面样本中得出的正对。 最近,这一原则还被用于学习视频和文字的跨模式嵌入,但没有充分利用其全部潜力。 特别是,先前的亏损没有考虑到模式内部的相似性,导致嵌入空间的多个点绘制了相同的内容,从而导致嵌入效率低下。 在CrossCLR中,我们呈现了一个对比性损失,从而解决了这一问题。 此外,我们从输入嵌入的角度定义了高度相关的样本,并将这些样本从负面样本中排除出来,以避免出现虚假的负面问题。我们表明,这些原则始终在不断提高所学嵌入的质量。与CrossCLR的联入扩展了在Youcook2和LSMDC数据集上的视频文本检索以及用视频字幕对Youcook2数据进行大比例设置的数据进行更新。 我们还通过学习改进其他模式组合的联合嵌入,展示了这一概念的普遍性。