Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never seen by the video encoder, which does not learn to process that modality. We address this drawback of current pre-training methods, which fail to exploit the rich cues in spoken language. Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech. We mask an entire modality in the input and predict it using the other two modalities. This encourages each modality to collaborate with the others, and our video encoder learns to process appearance and audio as well as speech. We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.
翻译:大规模未贴标签数据集的预培训显示,在计算机视觉和自然语言处理领域业绩的改善给人留下深刻印象。鉴于大规模教学视频数据集的出现,培训前视频编码员的共同战略是将伴随的演讲作为薄弱的监管。然而,由于语言用于监督培训前的监管,视频编码器从未看到过它,它没有学会处理该模式。我们解决了目前培训前方法的缺陷,它未能利用口头语言的丰富提示。我们的提议是,用所有可用的视频模式,即外观、声音和转录的语音,对视频编码器进行预培训。我们在输入中掩盖了整个模式,并用其他两种模式预测它。这鼓励每种模式与其他模式合作,我们的视频编码器学会处理外观和音频以及语音。我们展示了我们在 How2R、YouCook2和Condenseed电影数据集的视频检索中“调制”前培训方法的优异性表现。