This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.
翻译:本文调查过去几年中开发的多式联运情报的视觉语言前培训方法。我们将这些方法分为三类:图像文本任务,如图像字幕、图像文本检索、视觉问答和视觉地面定位等,(美元)VLP;核心计算机视觉任务,如(开放式)图像分类、物体探测和分解等,(二)VLP;视频文本任务,如视频字幕、视频文本检索和视频问题解答,(三)VLP。对于每一类,我们提出对最新方法的全面审查,并讨论已经取得的进展和仍然面临的挑战,使用具体的系统和模型作为案例研究。此外,我们还讨论研究界正在积极探讨的先进课题,如大基础模型、统一模型、文字中的少量学习、知识、坚固性和计算机在野生中的视觉,等等。