As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.
翻译:作为人类,我们理解视觉世界中的各种事件,在时间上进行多式联运推理,以推断过去、现在和未来。我们引入了MERLOT,这是一种通过完全无标签、自我监督的方式,通过观看数以百万计的YouTube视频,以无标签、完全不受自我监督的方式通过转录的演讲来学习多式文字知识的模式。通过使用框架级(空间)和视频级(时空)目标相结合的预演,我们的模型不仅学会将图像与时间对应的文字相匹配,而且将全球正在发生的事情背景化。结果,MERLOT展示了时间常识的强大外格表现,在12个不同的视频QA数据集上实现了最新的艺术表现。它还将静态图像传送到世界,允许模型在视觉场后对动态背景进行理解。在视觉常识学理性学上,MERLOT回答的问题正确无误,80.6%的准确度、超过3 % 水平的近于类似目标的状态模型。结果,MERLOT展示了3级以上的超标点,甚至超标级的模型,在12个不同的视频Q数据集数据集数据集中,并展示了该级前的升级分析。