具有图像描述语言模型的语言模型 (Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners)

Zhenhailong Wang,Manling Li,Ruochen Xu,Luowei Zhou,Jie Lei,Xudong Lin,Shuohang Wang,Ziyi Yang,Chenguang Zhu,Derek Hoiem,Shih-Fu Chang,Mohit Bansal,Heng Ji

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets. Code and resources are publicly available for research purposes at https://github.com/MikeWangWZHL/VidIL .

翻译：这项工作的目标是建立灵活的视频语言模型,这些模型可以从几个例子,例如特定域的字幕、答题和未来的事件预测等,概括到各种视频到文字任务。现有的少量视频语言学习者专门侧重于编码器,导致没有视频到文字的解码器来处理变形任务。视频字幕在大型视频语言数据集上预先接受了培训,但严重依赖微调,缺乏在几张图片设置中生成不可见任务文本的能力。我们建议VidIL,一个通过图像和语言模型拍摄的少数视频语言学习者,展示少数视频到文字的任务的出色性能,而无需对任何视频数据集进行预先培训或微调。我们使用图像语言模型将视频内容转换成框架标题、目标、目标、属性和事件短语,并把它们编成一个时间结构模板。我们然后指导一个语言模型,同时提供一些直观例子,通过图像和语言模型,从构成的视频和语言模型中产生一个目标输出。我们所培训的视频-视频-视频-视频-视频-视频-视频-视频-视频-视频-视频-视频-视频-视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频/视频