Vision-language pre-training (VLP) on large-scale image-text pairs has achieved huge success for the cross-modal downstream tasks. The most existing pre-training methods mainly adopt a two-step training procedure, which firstly employs a pre-trained object detector to extract region-based visual features, then concatenates the image representation and text embedding as the input of Transformer to train. However, these methods face problems of using task-specific visual representation of the specific object detector for generic cross-modal understanding, and the computation inefficiency of two-stage pipeline. In this paper, we propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation, namely E2E-VLP, where we build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text. We incorporate the tasks of object detection and image captioning into pre-training with a unified Transformer encoder-decoder architecture for enhancing visual learning. An extensive set of experiments have been conducted on well-established vision-language downstream tasks to demonstrate the effectiveness of this novel VLP paradigm.
翻译:关于大型图像-文本双对的愿景预培训(VLP)在跨模式下游任务方面取得了巨大成功,现有最先进的培训前方法主要采用两步培训程序,首先使用预先培训的物体探测器提取基于区域的视觉特征,然后将图像显示和文字嵌入作为变异器培训的输入内容。然而,这些方法在使用特定物体探测器特定任务直观显示通用跨模式理解和计算两阶段管道效率方面面临着问题。在本文件中,我们提出了第一步V+L理解和生成的端到端预培训的愿景预培训模式,即E2E-VLP,我们在此建立一个统一的变异器框架,以共同学习视觉表现以及图像和文字之间的语义调整。我们把对象探测和图像说明任务纳入与增强视觉学习的统一变异器编码-解码器结构的预培训中。我们已在完善的愿景-语言下游任务上进行了广泛的实验,以展示这一新型VLP模式的有效性。