Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.
翻译:最近基于双重编码器的视觉语言预科培训(VLP)模型,由于在各种跨模式任务和高计算效率方面表现优异,吸引了学术界和行业的广泛关注。它们试图通过对图像-文本配对的对比性学习跨模式代表制,但是,所建的跨模式关系只依赖每种模式的单一观点。事实上,图像或文本包含各种潜在观点,正如人类可以通过多种描述或照片捕捉真实世界的场景一样。在本文件中,我们建议建立ERNIE-Vil2.0,一个多维对比学习框架,在各种观点之间同时建立内部和模式的相互关系,目的是学习更强的跨模式代表制。具体地说,我们在每个模式中构建多种观点,学习内部关系,以加强单一模式代表制。除了内在的视觉/文字观点外,我们构建物体标记的序列,作为特殊的文本成熟的版本,以缩小在调动图像-VIE-文本配对的跨模式上跨模式的跨模式差距。前,我们通过可公开获取的SOL格式,在普通数据排序中,通过Slimal-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-l-lexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。