Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles - learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.
翻译:自2010年以来,深层学习使语音识别、图像识别和自然语言处理革命了2010年以来,每个都涉及输入信号中单一模式的语音识别、图像识别和自然语言处理。然而,人工智能中的许多应用涉及不止一种模式,因此,研究各种模式的建模和学习这一更为困难和复杂的问题具有广泛的兴趣。在本文件中,提供了对多式联运情报模式和学习方法的技术审查。主要重点是愿景和自然语言的结合,这已成为计算机视觉和自然语言处理研究界的一个重要领域。本审查从三个新角度对多式联运深层学习的近期工作进行了全面分析:学习多式联运演示、不同级别的多式联运信号融合以及多式联运应用。关于多式联运教学,我们审查了将多式联运信号统一到同一矢量空间并从而能够进行跨式信号处理的关键概念。我们还审查了为一般下游任务而构建和学习的多种类型的嵌入和自然语言的特性。关于多式联运融合,本审查侧重于从三个新角度对多式联运信号的体现的特殊结构。关于应用、选择的多式展示领域以及当前图像学界对新版本的理解,我们相信的版本研究领域。