高MMMT:高模式代表性学习的量化模式和互动多样性 (HighMMT: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning)

Many real-world problems are inherently multimodal, from the communicative modalities humans use to express social and emotional states to the force, proprioception, and visual sensors ubiquitous on robots. While there has been an explosion of interest in multimodal representation learning, these methods are still largely focused on a small set of modalities, primarily in the language, vision, and audio space. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios. Since adding new models for every new modality or task becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? We propose two new information-theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities $\{X_1,X_2\}$ are by measuring how much information can be transferred from $X_1$ to $X_2$, while (2) interaction heterogeneity studies how similarly pairs of modalities $\{X_1,X_2\}, \{X_3,X_4\}$ interact by measuring how much interaction information can be transferred from $\{X_1,X_2\}$ to $\{X_3,X_4\}$. We show the importance of these proposed metrics in high-modality scenarios as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to $10$ modalities and $15$ tasks from $5$ different research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and transfers to entirely new modalities and tasks during fine-tuning.

翻译：许多现实世界问题本质上是多式的,从交流模式,人类用来表达社会和情绪状态的新模型到力量、自我感知和视觉传感器,机器人上到处是可见感传感器。虽然对多式代表学习的兴趣已经激增,但这些方法在很大程度上仍然集中在少数模式上,主要在语言、视觉和音频空间。为了加速向多样化和研究不足的模式推广,本文研究高时制情景的高效代表学习。由于为每一种新模式或任务添加新的模型,以表达社会和情感状态到力量、自我感知和视觉。由于为每一种新模式或任务添加新的模型以表达社会和情感状态,一个关键的技术挑战就是异质性量化:我们如何测量类似模式的信息和互动模式的模式,以便允许与先前的模式分享?我们提出两种新的信息理论性能计量,主要在语言、视觉、视觉和新模式中如何将信息从美元到美元、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、货币、X、商品、商品、商品、商品、商品、商品、商品、商品、商品、商品、商品、商品、性能、商品、性能、性能、性能、性能、性、性能、性能、性能、性能、性能、性能、性能、性能、性能、性能、性能、性能、性能、性能、性能、性能、性、性、性、性、性能、性能、性能、性、性、性、性、性、性、性、性、、、、、、、性、性、性、、性、Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx、性、、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、性、、性、性、性、性、性、性、性、性、性、性、性、性、