With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) tasks, scholars have introduced an abundance of deep learning models in this research domain. Furthermore, in recent years, transfer learning has also shown tremendous success in Computer Vision for tasks such as Image Classification, Object Detection, etc., and in Natural Language Processing for Question Answering, Machine Translation, etc. Inheriting the spirit of Transfer Learning, research works in V&L have devised multiple pretraining techniques on large-scale datasets in order to enhance the performance of downstream tasks. The aim of this article is to provide a comprehensive revision of contemporary V&L pretraining models. In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pre-trained models. Moreover, a list of training datasets and downstream tasks is supplied to further polish the perspective on V&L pretraining. Lastly, we decided to take a further step to discuss numerous directions for future research.
翻译:随着图像文本数据数量的增加以及视觉和语言(V & L)任务的多样性,学者们在这一研究领域引进了丰富的深层次学习模式,此外,近年来,转移学习还显示,在诸如图像分类、物体探测等任务方面的计算机愿景以及用于问答、机器翻译等任务的自然语言处理方面取得了巨大成功。随着转移学习精神的发扬,V & L的研究工作就大型数据集设计了多种预培训技术,以加强下游任务的绩效。本文章的目的是全面修订当代V & L培训前模式。特别是,我们分类和界定了培训前方法,以及最新的视觉和语言预培训模式摘要。此外,还提供了培训数据集和下游任务清单,以进一步发光关于V & L预培训的观点。最后,我们决定采取进一步的步骤讨论今后研究的诸多方向。