The task of text-video retrieval aims to understand the correspondence between language and vision, has gained increasing attention in recent years. Previous studies either adopt off-the-shelf 2D/3D-CNN and then use average/max pooling to directly capture spatial features with aggregated temporal information as global video embeddings, or introduce graph-based models and expert knowledge to learn local spatial-temporal relations. However, the existing methods have two limitations: 1) The global video representations learn video temporal information in a simple average/max pooling manner and do not fully explore the temporal information between every two frames. 2) The graph-based local video representations are handcrafted, it depends heavily on expert knowledge and empirical feedback, which may not be able to effectively mine the higher-level fine-grained visual relations. These limitations result in their inability to distinguish videos with the same visual components but with different relations. To solve this problem, we propose a novel cross-modal retrieval framework, Bi-Branch Complementary Network (BiC-Net), which modifies transformer architecture to effectively bridge text-video modalities in a complementary manner via combining local spatial-temporal relation and global temporal information. Specifically, local video representations are encoded using multiple transformer blocks and additional residual blocks to learn spatio-temporal relation features, calling the module a Spatio-Temporal Residual transformer (SRT). Meanwhile, Global video representations are encoded using a multi-layer transformer block to learn global temporal features. Finally, we align the spatio-temporal relation and global temporal features with the text feature on two embedding spaces for cross-modal text-video retrieval.
翻译:文本视频检索的任务旨在理解语言和视觉之间的对应关系,近年来这种任务越来越受到越来越多的关注。以前的研究要么采用现成的2D/3D-CNN,然后使用平均/最大集合,直接捕捉空间特征,作为全球视频嵌入式的汇总时间信息,或者引入基于图形的模型和专业知识,以学习当地的时空关系。然而,现有的方法有两个局限性:(1)全球视频演示以简单平均/最大集合的方式学习视频时间信息,而不是充分探索两个框架之间的时间信息。(2)基于图形的当地视频展示是手动制作的,这在很大程度上取决于专家知识和经验反馈,而后者可能无法有效地利用高层次的精度视觉信息作为全球视频嵌入时间信息,这些局限性导致它们无法将视频与相同的视觉组成部分区别开来,但与不同的关系。为解决这一问题,我们提议了一个全新的跨模式检索框架,即Bi-Branch C补充网络(BIC-Net),它改变变换结构,通过将当地空间-时间结构变换变换全球的文本与多层图像关系相结合,我们使用多空间-变现变换的图像结构与Sal-Sal-Scal real-real-resmal-resmational-resmal-resmlational-resmational-laveal-la联。