CLIP2Video:通过图像 CLIP 掌握视频-文本检索器 (CLIP2Video: Mastering Video-Text Retrieval via Image CLIP)

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

翻译：我们提出CLIP2Video网络,以端到端的方式将图像预培训模式转换为视频文本检索; 视频和语言学习领域的主要方法试图从大型视频文本数据集中蒸馏时空视频特征以及视频和语言之间的多模式互动; 与它们不同,我们利用预先培训的图像语言模式,将其简化为一个两阶段框架,共同学习图像文本,加强视频框架和视频文本之间的时间关系,使其能够在相对小的数据集上进行培训。具体地说,根据对抗性语言图像预培训模式(CLIP)所捕捉的空间语义学,我们的模式涉及一个时空差异区,以在精细的时间视频框架中捕捉动作,以及一个时空调整区,以重新配对视频短片和短片的标语,并加强多模式的关联性。我们进行了彻底的校正研究,并实现了主要文本到视频框架和视频到视频版本检索基准方面的最先进的业绩。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【2020新书】Python文本分析，104页pdf

专知会员服务

100+阅读 · 2020年12月23日