Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design three pre-training tasks: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Region Modeling (MRM, with three variants). Different from concurrent work on multimodal pre-training that apply joint random masking to both modalities, we use conditioned masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). Comprehensive analysis shows that conditioned masking yields better performance than unconditioned masking. We also conduct a thorough ablation study to find an optimal setting for the combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2.
翻译:联合图像嵌入是大多数视觉和语言(V+L)任务的基石,其中,多式联运投入经过共同处理,以获得视觉和文字理解。在本文中,我们介绍UNITER(UNITER),一个UNiversal图像-Ext代表,通过四个图像-文字数据集(CO、Vision Groupe、概念说明和SBU Captions)的大规模预培训学习而学习,这四个数据集(CO、Vision Groupe、概念说明和SBU Captions)可以使下游混杂V+L任务与联合多式联运嵌入。我们设计了三种培训前任务:蒙蔽语言模型(MLM)、图像-文本匹配(ITM)和遮蔽区域模型(MRM,有三个变异异体)。与同时进行的关于对两种模式采用联合随机掩蔽的多式联运预培训前工作不同,我们使用有条件的遮掩面语言/区域模型取决于对图像/文字的完全观察。全面分析表明,有条件的遮盖比不固定的图像掩蔽比不固定遮掩罩的性能。我们还,我们还进行彻底的图像关系研究,以找到新的视觉图像实验,以最佳的图像分析,并展示前的图像的图像的图像的图像分析,以展示,以显示ULILILILeximal-eximal-Lex-IL的组合。