Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness. In this work, however, we reveal that cross-modal contrastive learning suffers from incorrect normalization of the sum retrieval probabilities of each text or video instance. Specifically, we show that many test instances are either over- or under-represented during retrieval, significantly hurting the retrieval performance. To address this problem, we propose Normalized Contrastive Learning (NCL) which utilizes the Sinkhorn-Knopp algorithm to compute the instance-wise biases that properly normalize the sum retrieval probabilities of each instance so that every text and video instance is fairly represented during cross-modal retrieval. Empirical study shows that NCL brings consistent and significant gains in text-video retrieval on different model architectures, with new state-of-the-art multimodal retrieval metrics on the ActivityNet, MSVD, and MSR-VTT datasets without any architecture engineering.
翻译:在这项工作中,我们发现,交叉模式对比学习由于每个文本或视频实例的总检索概率的不正确正常化,而使每个文本或视频实例的总检索概率发生不正确的正常化。具体地说,我们表明,许多测试实例在检索过程中要么过多,要么代表性不足,极大地损害了检索性能。为解决这一问题,我们提议采用标准化的对立学习算法,利用Sinkhorn-Knopp算法来计算出实例偏差,使每个案例的总检索概率适当地正常化,从而使每个文本和视频实例在交叉模式检索中都得到公平的体现。根据经验,研究表明,NCL在不同模型的文本视频检索中带来了一致和显著的收益,在活动网、MSVD和MSR-VTT数据集上采用了新的、最新的多式联运检索标准,而没有建筑工程。