We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.
翻译:我们探索建立一个基础视频文本模型的高效方法。 我们展示了视频CoCa, 将经过培训的图像-文字对比字幕(CoCa)模型最大化地再利用,并在最低限度的额外培训下将其适用于视频文本任务。 虽然以前的工作调整了带有各种跨框架聚合模块的图像文本模型,但我们发现CoCa的基因式集中式集中式和对比式集中式集中式层可立即适应固定式嵌入式,在零发视频分类和零发文本到视频检索上产生最新的结果。 此外,我们还探索视频CoCa顶部的轻量微调,并在视频问答和视频字幕上取得显著成果。</s>