Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Processing (NLP). This progress leads to learning joint representations of vision and language pretraining by feeding visual and linguistic contents into a multi-layer transformer, Visual-Language Pretrained Models (VLPMs). In this paper, we present an overview of the major advances achieved in VLPMs for producing joint representations of vision and language. As the preliminaries, we briefly describe the general task definition and genetic architecture of VLPMs. We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content. We further summarise several essential pretraining and fine-tuning strategies. Finally, we highlight three future directions for both CV and NLP researchers to provide insightful guidance.
翻译:预先培训的模式在计算机视野和自然语言处理方面都取得了巨大成功,通过将视觉和语言内容纳入多层变压器、视觉-语言先导模型(VLPMs),通过将视觉和语言内容纳入多层变压器、视觉-语言先导模型(VLPMs),学习了视觉和语言预导模型的共同表述和语言预演。我们在本文件中概述了VLPMs在联合展示视觉和语言方面所取得的重大进展。作为初步材料,我们简要描述了VLPMs的一般任务定义和遗传结构。我们首先讨论语言和视觉数据编码方法,然后将主流VLPM结构作为核心内容。我们进一步总结了几项基本的训练前导和微调战略。最后,我们强调了CV和NLP研究人员提供有见地指导的三项未来方向。