Vision-Language models have shown strong performance in the image-domain -- even in zero-shot settings, thanks to the availability of large amount of pretraining data (i.e., paired image-text examples). However for videos, such paired data is not as abundant. Thus, video-text models are usually designed by adapting pretrained image-text models to video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image -> video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue that such adapted video-text models can benefit more by augmenting text rather than visual information. We propose VicTR, which jointly-optimizes text and video tokens, generating 'Video-conditioned Text' embeddings. Our method can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g., object or scene information). We conduct experiments on multiple benchmarks including supervised (Kinetics-400, Charades), zero-shot and few-shot (HMDB-51, UCF-101) settings, showing competitive performance on activity recognition based on video-text models.
翻译:视觉语言模型已经在图像领域表现出强大的性能——甚至在零样本设置中也有出色的表现,这要归功于大量预训练数据的可用性(即配对的图像-文本示例)。但是对于视频来说,这样的配对数据并不如此丰富。因此,视频-文本模型通常是通过将预训练的图像-文本模型适应到视频领域来设计的,而不是从头开始训练。所有这些方法都依赖于通过时间信息增强视觉嵌入(即图像 - > 视频),通常保持文本嵌入不变甚至被丢弃。在本文中,我们认为这样的适应视频 - 文本模型更多地受益于增强文本而不是视觉信息。我们提出了 VicTR,它共同优化文本和视频令牌,生成 '视频条件化的文本' 嵌入。我们的方法还可以利用自由提供的语义信息,以可视化的辅助文本的形式(例如对象或场景信息)。我们在多个基准测试中进行实验证明,包括监督(Kinetics-400,Charades)、零样本和少样本(HMDB-51,UCF-101)设置,展示了基于视频文本模型的竞争性能。