In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.
翻译:在本文中,我们研究如何在视觉和语言(V+L)代表式学习中使用蒙面信号模型。我们建议建立共同蒙面愿景和语言模型,在另一种模式的帮助下重建一种模式的蒙面信号。这受图像和文本配对数据的性质所驱动,这些图像和文本传递的信息几乎相同,但以不同格式传递。一种模式的蒙面信号重建以另一种模式为条件,也可以隐含地学习语言符号和图像补丁之间的跨模式协调。我们在各种V+L任务上的实验表明,拟议方法,加上通用V+L对齐损失,在数百万个培训前数据制度中取得了最新业绩。此外,我们在有限的数据假设中比其他竞争者高出了相当大的空间。</s>