Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://aka.ms/X-CLIP
翻译:从网络比例数据中学习视觉-文字联合显示,预培训前的语言对比图像显示,在从网络规模数据中学习视觉-文字联合演示方面,非常成功,展示了不同图像任务的显著“零光”一般化能力。然而,如何有效地将这种新的语言图像前培训方法扩大到视频域仍然是一个尚未解决的问题。在这项工作中,我们提出了一个简单而有效的方法,将预先培训的语言图像模型直接调整为视频识别,而不是从头开始训练一个新的模型。更具体地说,为了在时间尺度上捕捉框架的长距离依赖性,我们提议了一个跨框架关注机制,明确交换跨框架的信息。这种模块是轻量级的,可以被插入预先训练的语言模拟模型。此外,我们提出了一个针对视频的提示计划,利用视频内容信息来生成歧视性的文本提示。 广泛的实验表明,我们的方法是有效的,并且可以推广到不同的视频识别情景。 特别是,在完全监控的环境下,我们的方法在 Kinecticrical-400上实现了87.1 % 的顶级和顶级模型,同时使用最短的FLOP7 和最短的版本方法,以12倍的比前的VILex-hex-hex-ro-rod-laxxxxxxxxxxxxxxx。