Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.
翻译:文本- Video Retrieval 在多模式理解中发挥着重要作用,近年来引起了越来越多的关注。大多数现有方法侧重于在整个视频和完整标题句子之间建立对比配对,同时忽略细微的跨模式关系,例如短句或框架词。在本论文中,我们提出了一个创新方法,名为HCMI, 以探索视频-感官跨模式互动(HCMI), 探索视频-感官-感应、短片和文字视频检索框架词组之间的多层次跨模式关系。考虑到内在的语义框架关系,HCMI进行自我意识,以探索框架-级别相关性和适应性组合关系,在剪接和视频级别上进行。HCMI为框架-剪接-感知-感官-感光谱分析,捕捉微调视频内容,在文字-感官-感官-感应、58-感官-感官-感官-感应、58-感官-感官-感官-感官-感官-感应、语言-感官-感官-感官-感应-感官-感官-感官-感官-感官-感官-感官-感官-感官-感官-感官-感官-感官-感知-性-感知-感官-性-感知-感官-性-感官-感官-性-性-感官-感官-感官-性-性-性-性-性-性-性-性-性-性-性-性-性-性-感-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性-性