The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.
翻译:过去几年在多式机器学习领域出现了迅速的发展,将愿景、文本或语言等结合起来。在本立场文件中,我们解释该领域如何使用过时的多式联运定义,这些定义证明不适合机器学习时代。我们提议在多式机器学习背景下对(多式)模式作出新的任务-相对性定义,侧重于与特定机器学习任务相关的表述和信息。随着我们新的多式联运定义,我们的目标是为多式研究提供一个缺失的基础,这是语言基础的重要组成部分,也是通向新卢布的关键里程碑。