Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks. Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page https://unimo-ptm.github.io/.
翻译:在各种跨模式下游任务中,视野-语言培训前预科(VLP)取得了令人印象深刻的成绩,然而,大多数现有方法只能从统一的图像显示数据中学习,并严重依赖昂贵的区域特征,这些特征极大地限制了其可缩放性和性能。在本文件中,我们提议了一个端到端的统一模式前培训框架,即UNIMO-2,用于共同学习统一的图像描述数据和不匹配的仅供图像和仅供文本的文体。我们建立了一个统一的变换模型,以共同学习图像和文本之间的视觉表现、文本表达和语义调整。特别是,我们提议通过共享的基于空间,在图像和文本上进行基础学习,帮助连接不匹配的图像和文字,并协调不同类型体形体的视觉和文字的语义空间。实验显示,我们基于的学习方法可以改进各种模式的文字和视觉语义协调,以改进各种模式的性能。此外,从不同类型公司的有效联合建模中获益,我们的模式还在单一模式/视觉/文字任务上取得了令人印象深刻的成绩。我们的代码和文字模型是UNIMO/MIMO项目。