This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension. We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training (VLP) methods, and larger models empowered by large-scale weakly-labeled data. We first take some common VL tasks as examples to introduce the development of task-specific methods. Then we focus on VLP methods and comprehensively review key components of the model structures and training methods. After that, we show how recent work utilizes large-scale raw image-text data to learn language-aligned visual representations that generalize better on zero or few shot learning tasks. Finally, we discuss some potential future trends towards modality cooperation, unified representation, and knowledge incorporation. We believe that this review will be of help for researchers and practitioners of AI and ML, especially those interested in computer vision and natural language processing.
翻译:本文从时间角度对愿景语言(VL)情报进行了全面的调查。这次调查的灵感来自计算机愿景和自然语言处理的显著进展,以及最近的趋势,从单一模式处理转向多种模式理解。我们总结了这一领域的发展,分为三个时间段,即任务特定方法、愿景语言培训前方法和大规模标签薄弱的数据所增强的较大模型。我们首先将一些共同的 VL任务作为实例,介绍具体任务方法的发展。然后,我们着重研究VLP方法,并全面审查模型结构和培训方法的关键组成部分。之后,我们展示了最近的工作如何利用大型原始图像文本数据学习与语言一致的视觉表现,这种形象表现在零或很少的学习任务上更为普遍。最后,我们讨论了未来在模式合作、统一代表性和知识整合方面的潜在趋势。我们认为,这次审查将有助于AI和ML的研究人员和从业人员,特别是那些对计算机愿景和自然语言处理感兴趣的研究人员和从业人员。