The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.
翻译:学习视频文本表述的主导模式 -- -- 噪音对比学习 -- -- 增加了已知相关样本的相似性,例如来自同一样本的文本和视频,将所有其他样本的表述排斥在一边。我们假设,最后一种行为过于严格,甚至对与语义相关的样本也实施不同表述 -- -- 例如,视觉相似的视频或具有相同描述动作的视频。在本文件中,我们提议了一种新颖的方法,通过利用基因化模型自然推动这些相关样本来缓解这一点:每个样本的字幕必须重建为其他支持样本的视觉表述的加权组合。这一简单的想法确保了演示对单个样本的不过分专业化,在数据集中可以重新使用,并导致清晰地将样本之间共享的语义表达方式编码起来,与噪声对比学习不同。我们提出的方法在MSR-VTTT、VATEX和ActionNet以及用于视频文本到视频检索的MSVD上比其他方法大幅度。