As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
翻译:作为人类,我们通过我们所有的感官,利用每个人的感知性投入环游世界,纠正其他视频。我们引入了MERLOT 保留模式,这是一个长期共同展示视频的模型,通过一个从音频、字幕和视频框架学习的新培训目标,通过一个从音频、字幕和视频框架学习的新培训目标,我们用一个MASK标志取代文本和音频片片段;模型通过选择正确的遮蔽外片片段学习;我们的目标比替代方案学习得更快,并且表现得非常出色:我们在YouTube的2 000万视频上做了预设。经验性结果显示,MERLOT 储备通过所有构成模式学习视频的强烈表现。在微调时,它设置了一个新的VCR和TVQA的艺术状态,比以前的工作分别高出5%和7%。缩略图表明,这两项任务都得益于音预训练 -- 甚至VCRR, QA任务的核心是图像(没有声音) 。此外,我们的目标可以进行外置的预测,揭示强大的多式共同理解理解。在完全零镜头中,我们模型中,最近提出的将获得有竞争力分析的四度分析。