Videos contain multi-modal content, and exploring multi-level cross-modal interactions with natural language queries can provide great prominence to text-video retrieval task (TVR). However, new trending methods applying large-scale pre-trained model CLIP for TVR do not focus on multi-modal cues in videos. Furthermore, the traditional methods simply concatenating multi-modal features do not exploit fine-grained cross-modal information in videos. In this paper, we propose a multi-level multi-modal hybrid fusion (M2HF) network to explore comprehensive interactions between text queries and each modality content in videos. Specifically, M2HF first utilizes visual features extracted by CLIP to early fuse with audio and motion features extracted from videos, obtaining audio-visual fusion features and motion-visual fusion features respectively. Multi-modal alignment problem is also considered in this process. Then, visual features, audio-visual fusion features, motion-visual fusion features, and texts extracted from videos establish cross-modal relationships with caption queries in a multi-level way. Finally, the retrieval outputs from all levels are late fused to obtain final text-video retrieval results. Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner. Moreover, a novel multi-modal balance loss function is proposed to balance the contributions of each modality for efficient end-to-end training. M2HF allows us to obtain state-of-the-art results on various benchmarks, eg, Rank@1 of 64.9\%, 68.2\%, 33.2\%, 57.1\%, 57.8\% on MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet, respectively.
翻译:64 视频包含多式内容,探索与自然语言查询的多层次跨模式互动,可以极大地突出文本视频检索任务(TVR),但是,对TVR应用大规模预培训模型CLIP的新的趋势方法并不侧重于视频中的多式提示;此外,传统方法只是将多式特征混为一体,并不利用视频中的细微跨式信息;在本文中,我们提议建立一个多级多式多式多式混合网络(M2HF),以探索文本查询和视频中每种模式内容之间的全面互动。具体地说,M2HF首先利用CLIP提取的视觉特征与从视频中提取的音频和运动功能的早期结合,获得视听融合功能和运动融合功能。 多式调整问题也在这一过程中得到考虑。 然后,视觉特征、视听融合功能、移动-视觉组合特征和从视频中提取的文本在多层次上建立跨式关系。 最后,从CLIPPO到最后版本的检索功能,从所有级别上都提供最后的文本。