Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and word. In this paper, we propose a novel method, named HunYuan\_tvr, to explore hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships. Considering intrinsic semantic relations between frames, HunYuan\_tvr first performs self-attention to explore frame-wise correlations and adaptively clusters correlated frames into clip-level representations. Then, the clip-wise correlation is explored to aggregate clip representations into a compact one to describe the video globally. In this way, we can construct hierarchical video representations for frame-clip-video granularities, and also explore word-wise correlations to form word-phrase-sentence embeddings for the text modality. Finally, hierarchical contrastive learning is designed to explore cross-modal relationships,~\emph{i.e.,} frame-word, clip-phrase, and video-sentence, which enables HunYuan\_tvr to achieve a comprehensive multi-modal understanding. Further boosted by adaptive label denoising and marginal sample enhancement, HunYuan\_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.
翻译:文本- Video Retrieval 在多模式理解中扮演了重要角色, 并且近年来引起了越来越多的关注。 大多数现有方法侧重于在整个视频和完整字幕句之间构建对比配对, 同时忽略细微的跨模式关系, 比如短剪和短短句或单边框和单字。 在本文中, 我们提出了一个创新方法, 名为 HunYuan ⁇ tvr, 探索等级跨模式互动, 通过同时探索视频发布、 短短片和框架词句关系。 考虑到框架之间的内在语义关系, HunYuan ⁇ tvr首先进行自我定位, 以探索框架相关关系和适应性分组关联到剪接层次的表达。 然后, 将剪接式组合成一个缩略图, 用来描述全球的视频。 这样, 我们就可以为框架- 翻版- 视频颗粒度构建等级视频演示, 并且探索文字- 版本- 的关联关系, 将语言- 嵌入文本模式 。 最后, 等级对比式- 度- 定义- 上- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本-