Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature during similarity calculation, thus improving the accuracy of retrieval. To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. However, another challenge lies in the similarity aggregation problem, which aims to aggregate fine-grained and cross-grained similarity matrices to instance-level similarity. To address this challenge, we propose the Attention Over Similarity Matrix (AOSM) module to make the model focus on the contrast between essential frames and words, thus lowering the impact of unnecessary frames and words on retrieval results. With multi-grained contrast and the proposed AOSM module, X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1 R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on these benchmarks, demonstrating the superiority of multi-grained contrast and AOSM.
翻译:视频文本检索是多模式研究中一项关键和根本的任务。大型的多模式对比前培训前,主要侧重于粗粗加分或细加分的对比。然而,交叉对比是粗加分的表示式和细加分的表示式之间的对比,在先前的研究中很少探讨过。与微微加分或粗加分的对比相比,交叉对比计算了粗加分的特性和每个细加分的特性之间的相对性,并且能够过滤由粗加分的表示色分数或细加分的对比性培训前的对比性。 但是,与微微的和粗加分的对比性对比性对比性相比,与微加分的对比性对比性对比性相比,与微加分的对比性对比性对比性对比性对比性,与微分分数的对比性对比性对比性对比性对比性关系,与微增分数的对比性对比性对比性对比性对比性对比性关系,与微分数分数的对比性对比性对比性对比性对比性对比性对比性对比性对比性对比性对比性关系,与微分数分数组对比性对比性对比性对比性对比性对比性对比性对比性对比性对比性对比性对比性对比性(R-加分数-加分比性、跨分数分数分比性、分数分数分数分数分数分数的特性、R4、R4、R4),RGR-8分比性比性矩阵、R-80变差的基的对比性比性矩阵、Rl性矩阵性矩阵性矩阵性矩阵性比性比性比性矩阵、最低变性矩阵、最低变性矩阵、最低变性比性矩阵、R1、R1、R1、R1、MFIF型、5-l性分析性分析性分析性分析性分析性分析性分析性矩阵、最低值、R1、5性矩阵性分析性分析性分析性分析性分析性、ML1、ML1、MFIFIF型、MF型、R1、5-l性、R1、MFIFIFIFIFIFIFIFIFI、R1、M、R1、R1、R1、RIF型