HANNet:视频文本检索的等级对齐网络 (HANet: Hierarchical Alignment Networks for Video-Text Retrieval)

Video-text retrieval is an important yet challenging task in vision-language understanding, which aims to learn a joint embedding space where related video and text instances are close to each other. Most current works simply measure the video-text similarity based on video-level and text-level embeddings. However, the neglect of more fine-grained or local information causes the problem of insufficient representation. Some works exploit the local details by disentangling sentences, but overlook the corresponding videos, causing the asymmetry of video-text representation. To address the above limitations, we propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching. Specifically, we first decompose video and text into three semantic levels, namely event (video and text), action (motion and verb), and entity (appearance and noun). Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text. Different level alignments capture fine-to-coarse correlations between video and text, as well as take the advantage of the complementary information among three semantic levels. Besides, our HANet is also richly interpretable by explicitly learning key semantic concepts. Extensive experiments on two public datasets, namely MSR-VTT and VATEX, show the proposed HANet outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment. Our code is publicly available.

翻译：视频文本检索是视觉语言理解中一项重要但具有挑战性的任务,目的是学习一个共同嵌入空间,让相关的视频和文本实例彼此接近。多数当前工作只是根据视频级别和文本层嵌入来测量视频文本相似性。但是, 忽略了更细微的或本地的信息导致代表性不足的问题。有些工作利用本地细节, 使用变换句子, 但忽略相应的视频文本表达方式, 导致视频文本表达方式的不对称。为解决上述限制, 我们提议建立一个等级调整网络( HANet), 将不同级别的视频文本匹配。具体地说, 我们首先将视频和文本的相似性分为三个语义级别, 即事件( 视频和文本)、动作( 动作和动动词) 和实体( 出现和名词) 。在此基础上, 我们自然地以个体- 全球方式构建等级的等级代表方式, 个人级别侧重于拟议框架和文字的对齐度, 地方一级侧重于视频剪辑和文字背景环境的对齐度, 全球一级则侧重于整个视频和文本对视频和文本的对齐度的校正关系, 。不同水平的校正的校正的校正的校正的校正的校正比, 。