We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Our minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer. Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images. VL-BEiT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training. Our method is conceptually simple and empirically effective. Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.
翻译:我们引入了名为VL-BEIT的视觉语言基础模型,这是一个双向多式联运变异器,通过基因化培训前学习。我们最起码的解决方案用一个共享变异器对单式和多式数据进行蒙面预测。具体地说,我们在图像-文本配对上进行蒙面的视觉语言模型,在文本上进行蒙面语言模型,在图像上进行蒙面图像模型。VL-BEIT从零到零学,一个统一的训练前任务,一个共用的骨干,一个阶段的培训。我们的方法在概念上简单,在经验上是有效的。实验结果显示,VL-BEIT在各种视觉语言基准(如视觉问题解答、视觉推理和图像-文字检索)上取得了强有力的结果。此外,我们的方法还学习可转移的视觉特征,在图像分类上实现竞争性表现,以及语义分化。