In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method not only achieves state-of-the-art performances by using a large amount of data, but also outperforms the other competitors by a significant margin in the regimes of limited training data.
翻译:在本文中,我们研究如何在视觉和语言(V+L)代表式学习中使用隐形信号模型。我们建议建立共同隐型愿景和隐型图像模型(MIM),在另一种模式的帮助下重建一种模式的隐型信号。这受图像和文字以不同格式传递几乎相同信息图像和文字的图像文本配对数据的性质的驱动。一种模式的蒙面信号重建以另一种模式为条件,也可以隐含地学习语言符号和图像补丁之间的跨模式对齐。我们在各种隐型语言符号和图像补对的实验显示,拟议方法不仅通过使用大量数据实现最先进的性能,而且还在有限的培训数据制度中大大超越其他竞争者。