Aligning objects with words plays a critical role in Image-Language BERT (IL-BERT) and Video-Language BERT (VDL-BERT). Different from the image case where an object covers some spatial patches, an object in a video usually appears as an object trajectory, i.e., it spans over a few spatial but longer temporal patches and thus contains abundant spatiotemporal contexts. However, modern VDL-BERTs neglect this trajectory characteristic that they usually follow IL-BERTs to deploy the patch-to-word (P2W) attention while such attention may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment for solving video-language tasks. Such alignment is learned by a newly designed trajectory-to-word (T2W) attention. Besides T2W attention, we also follow previous VDL-BERTs to set a word-to-patch (W2P) attention in the cross-modal encoder. Since T2W and W2P attentions have diverse structures, our cross-modal encoder is asymmetric. To further help this asymmetric cross-modal encoder build robust vision-language associations, we propose a fine-grained ``align-before-fuse'' strategy to pull close the embedding spaces calculated by the video and text encoders. By the proposed strategy and T2W attention, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material.
翻译:VDL-BERT 在图像- Language BERT (IL-BERT) 和视频- Language BERT (VDL-BERT) 中, 以文字对对象进行对齐具有关键作用。 与一个对象覆盖某些空间补丁的图像外, 一个视频中的对象通常显示为对象轨迹, 也就是说, 它跨越几个空间但更长的时间补丁, 因而包含丰富的时空背景。 然而, 现代 VDL- BERT 忽略了这个轨迹特征, 他们通常跟随 IL- BERT 来部署补接字( P2W) 注意, 而这种注意可能会过度利用微小的空间环境, 忽略重要的时间背景。 为了修正这个图像, 我们提议了一个新的轨迹对字的注意(T2W) 。 除了T2WWWWL-BERT 注意外, 我们还遵循了以前的 VDL- BERT, 在精细调的 On- Plent (W2P) 上, 在高级的 Oright- Co- dead comender IMore IMder 服务器 服务器上, 服务器上, 正在建立一个可比较的文本对调的文本对调的文本对调的动作对调的动作和对调的动作的动作的动作的动作的动作的动作, 的动作, 的动作的动作的动作, 的动作的动作将是一个算法式的动作。