Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and big data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal big data era, (2) a theoretical review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.
翻译:变异器是一个充满希望的神经网络学习者,在各种机器学习任务中取得了巨大成功,由于最近多式联运应用和海量数据的普及,以变异器为基础的多式联运学习已成为AI研究的一个热门话题,本文件对以多式联运数据为导向的变异器技术进行了全面调查,主要内容包括:(1) 多式联运学习、变异器生态系统和多式联运大数据时代的背景,(2) 从几何地形学角度对香草变异器、愿景变异器和多式联运变异器进行理论审查,(3) 通过两个重要范例对多式联运变异器应用进行审查,即多式联运前期培训和具体多式联运任务,(4) 概述多式联运变异器模型和应用所共有的共同挑战和设计,(5) 讨论社区面临的公开问题和潜在的研究方向。