Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.
翻译:培训前的视频和语言框架缺乏生成判决的能力。我们展示了多式视频生成前培训(MV-GPT),这是一个学习无标签视频的新的培训前框架,可以有效地用于多式视频字幕等基因化任务。与最近的视频前培训框架不同,我们的框架既培训了多式视频编码器,也联合培训了句子解码器。为了克服无标签视频中缺少字幕的问题,我们利用未来发音作为附加文本源,并提出了双向生成目标 -- -- 我们根据当前模式背景生成了未来发音,还提出了当前版本的发音。为此,我们培训了一个编码-解码模型端到端,直接生成了原始像素和转录语音的字幕。我们的模型在四个标准基准上实现了多式视频字幕的最新表现,并完成了其他视频理解任务,如视频QA、视频检索和行动分类。