We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.
翻译:我们提出CLIP2Video网络,以端到端的方式将图像预培训模式转换为视频文本检索; 视频和语言学习领域的主要方法试图从大型视频文本数据集中蒸馏时空视频特征以及视频和语言之间的多模式互动; 与它们不同,我们利用预先培训的图像语言模式,将其简化为一个两阶段框架,共同学习图像文本,加强视频框架和视频文本之间的时间关系,使其能够在相对小的数据集上进行培训。 具体地说,根据对抗性语言图像预培训模式(CLIP)所捕捉的空间语义学,我们的模式涉及一个时空差异区,以在精细的时间视频框架中捕捉动作,以及一个时空调整区,以重新配对视频短片和短片的标语,并加强多模式的关联性。 我们进行了彻底的校正研究,并实现了主要文本到视频框架和视频到视频版本检索基准方面的最先进的业绩。