As transformer evolves, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve downstream task performance becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, and then we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide researchers with synthesis and pointer to related research.
翻译:随着变压器的演进,经过培训的模型近年来取得了突破性的进展,在自然语言处理(NLP)和计算机视觉(CV)中占主导地位的主流技术。如何使培训前的学习适应视野和语言(V-L)领域,改进下游任务业绩,成为多式联运学习的一个焦点。在本文件中,我们回顾了《愿景-语言-培训前模型》(VL-PTMs)最近的进展。作为核心内容,我们首先简要地介绍了将原始图像和文字编码成单式嵌入培训前的几种方法。然后,我们把VL-PTMs在模拟文本和图像代表之间互动方面跳入主流结构。我们进一步介绍了广泛使用的训练前任务,然后我们又介绍了一些共同的下游任务。我们最后完成了这份文件并提出了一些有希望的研究方向。我们的调查旨在为研究人员提供有关研究的合成和指针。