This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
翻译:这项工作探索了一种高效的方法来为任务建立400个基础视频文本模型, 包括开放式视频分类、 文本到视频检索、 视频字幕和视频问答。 我们展示了视频CoCa 模型, 该模型可以重新使用一个经过预先训练的图像- 文本对比标题( CoCa) 模型, 并且将其适应到视频文本任务。 虽然先前的工程将图像文本模型与各种跨框架的组合模块( 例如跨框架关注层或感知器重印器), 并精细调整视频文本数据数据的修改结构, 我们很惊讶地发现, 图像- 视频- 视频- 视频- 视频搜索、 视频- 视频- 视频- 视频视频- 视频视频视频数据搜索、 视频- 视频- 视频- 视频- 视频- 视频视频- 视频- 视频- 视频- 视频视频视频视频- 视频- 视频- 视频视频- 视频- 视频- 视频视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 视频- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据- 数据