Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks. Our code and pre-trained models are public at the UNIMO project page https://unimo-ptm.github.io/
翻译:在这项工作中,我们提出了一个统一的培训前模式结构,即UNIMO,它能够有效地适应单一模式和多模式的理解和生成任务。大规模的免费文本资料和图像收集可以用来提高视觉和文字理解的能力,交叉模式对比学习(CMCL)只能用来将文本和视觉信息调整成一个统一的语义空间,而不是一对图像文本。在这项工作中,我们提出一个统一的培训前模式结构,即UNIMO,它可以有效地适应单一模式和多模式的理解和生成任务。此外,文本知识和视觉知识可以在统一的语义空间中加强彼此。实验结果显示,UNIMO大大改进了数个单一模式和多模式的对比学习(CMCL)的性能,将文字和视觉信息与成一套图像文本对应的统一的语义空间。由于非通用的单一模式数据非常丰富,我们的模式可以使用大得多的数据来学习更普遍的表达方式。此外,文字知识和视觉知识可以在统一的语义空间中加强彼此。实验结果显示,UNIMOD显著地改进了数个单一模式和多模式/多模式前UNMImoto-moto-modimotoal 。我们的代码是前的公共-mod-mod-mod-mod-mod-mod-modal-modududududududududucle