Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and word. In this paper, we propose a novel method, named HunYuan\_tvr, to explore hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships. Considering intrinsic semantic relations between frames, HunYuan\_tvr first performs self-attention to explore frame-wise correlations and adaptively clusters correlated frames into clip-level representations. Then, the clip-wise correlation is explored to aggregate clip representations into a compact one to describe the video globally. In this way, we can construct hierarchical video representations for frame-clip-video granularities, and also explore word-wise correlations to form word-phrase-sentence embeddings for the text modality. Finally, hierarchical contrastive learning is designed to explore cross-modal relationships,~\emph{i.e.,} frame-word, clip-phrase, and video-sentence, which enables HunYuan\_tvr to achieve a comprehensive multi-modal understanding. Further boosted by adaptive label denosing and marginal sample enhancement, HunYuan\_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.
翻译:文本- Video Retrieval 在多模式理解中扮演了重要角色,并在最近几年中吸引了越来越多的关注。 大多数现有方法侧重于在整个视频和完整标题句之间构建对比配对,而忽略细微的跨模式关系,例如短剪和短片或短语或单一框架和单词。在本文中,我们提出了一个创新方法,名为Hunyuan ⁇ tvr, 探索等级跨模式互动,同时探索视频-感官、短片和框架词基关系。考虑到框架之间的内在语义关系, HunYuan ⁇ tvr首先进行自我关注,以探索框架相关关系和适应性分组关联到剪接层。 然后,对剪接的关联进行探索,将组合片段表达变成一个缩略图,用这种方式,我们可以为框架-翻版-视频颗粒度构建等级的视频演示,并且探索文字- 版本- 、 版本- 键值- 、 版本- 版本- 版本- 、 版本- 版本- 版本- 、 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本- 版本-