Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining two key principles of modality heterogeneity and interconnections that have driven subsequent innovations, and propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.
翻译:多式机器学习是一个充满活力的多学科研究领域,目的是设计具有理解、推理和学习等智能能力的计算机代理,通过整合多种交流方式,包括语言、声学、视觉、触觉和生理信息,设计具有智能能力的计算机代理。由于最近对视频理解的兴趣,体现了自主代理、文字到图像生成和多种感知的结合,在医疗保健和机器人学习等应用领域,多式联运机器学习给机器学习界带来了独特的计算和理论挑战,因为数据源的不均匀性以及各种模式之间经常发现的相互联系。然而,由于多式联运研究进展的广度,很难确定该领域的共同主题和开放问题。通过从历史和最近的观点综合广泛的应用领域和理论框架,本文件旨在概述多式联运机器学习的计算和理论基础。我们首先界定了两种主要模式的遗传性和互联性原则,从而驱动了随后的创新,并提出了六种核心技术挑战的分类:代表性、一致性、逻辑性、生成、转移性、转移性和量化性,从而将最新趋势纳入历史和历史和历史趋势。我们通过一系列的税收激励方法,通过展示新的历史和历史和历史趋势,从而理解新的税收发展。