This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.
翻译:本文展示了OmniVL, 这是一种新的基础模型,用一个通用架构支持图像语言和视频语言任务;它为图像和视频投入采用了统一的变压器视觉编码器,从而可以进行图像语言和视频语言联合培训前的训练;我们首次展示了这样一种范式既有利于图像任务,也有利于视频任务,而不是传统的单向传输(例如,使用图像语言帮助视频语言);为此,我们提议对图像语言和视频语言进行混合的竞争前培训,以便有效地将视觉语言模型转换成空间和时间层面,并在图像和视频任务上提高性能;此外,我们引入了一种新的统一的视觉语言对比(UniVLC)损失,以利用图像文本、视频文本、图像标签(例如,图像分类)、视频标签(例如,视频行动识别)数据,因此,只有尽可能地利用监管的和受监管的预培训前数据,从而在不引起额外任务具体任务调整、OmVL级别和图像评估时,才能实现图像的跨级理解(例如、图像评估、图像的跨级)的任务(我们可理解、图像的横向理解、图像的跨级/理解、图像的排序),可以同时支持。