As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
翻译:作为人类,我们环绕着一个多式联运的世界,从我们所有的感官中建立整体理解。我们引入了MERLOT 储备模式,这个模式代表长期的视频 -- -- 通过从音频、字幕和视频框架中学习的新培训目标,通过一个从音频、字幕和视频框架中学习的新培训目标,我们用一个视频,用一个MASK标志取代文本和音频片片段;模型通过选择正确的遮蔽外片片片段学习;我们的目标比其他方法学习得更快,并且表现得非常好:我们在2,000万个YouTube视频上进行预演。经验性结果显示,MERLOT 储备可以学习强大的多式联运代表。在微调调整后,它设置了关于视觉常识(VCRR)、TVQA和Kiniticals-600的最新知识;在以前的工作上,我们用5%、7%和1.5%的表现表现。 推论表明,这些任务从音频前训练中受益 -- -- 甚至VCR, QA任务围绕图像(没有音响) 。此外,我们的目标可以进行外置预测,揭示出更好的多式共同理解。在完全的图像上,我们未来的研究中,我们提出的模型分析前的模型将获得了重要前景。