Video-Text Retrieval has been a hot research topic with the explosion of multimedia data on the Internet. Transformer for video-text learning has attracted increasing attention due to the promising performance.However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Limited exploitation of the transformer architecture where different layers have different feature characteristics. 2) End-to-end training mechanism limits negative interactions among samples in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our methods.
翻译:视频文本检索是一个热门的研究课题,因为互联网上多媒体数据爆炸。视频文本学习的变异器由于有希望的性能而引起越来越多的关注。然而,现有的跨模式变异器方法通常受到两大限制:(1) 对不同层次具有不同特征的变异器结构的有限利用。(2) 端到端培训机制限制了小型批量样本之间的负面互动。在本文中,我们提议了一种名为“高层次变异器(HiT)”的新颖方法,用于视频文本检索。 HiT在地貌层次和语义层次上进行等级的跨模式对比,以取得多视图和综合检索结果。此外,在MoCo的启发下,我们提议为跨模式学习提供Momentum跨模式的对立,以促成大规模在飞行上的负面互动,这有助于产生更精确和更具歧视性的表述。关于三种主要视频文本检索基准数据集的实验结果展示了我们方法的优势。