Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.
翻译:多模态学习是指在多种异构输入模态(如视频、音频和文本)上进行学习。在本文中,我们关注模型在训练和部署期间模态类型不同的情况下的行为,这在多模态学习应用于硬件平台的许多应用中自然发生。我们提出了多模态鲁棒性框架,以系统地分析常见的多模态表示学习方法。此外,我们还找出了这些方法的鲁棒性不足之处,并提出了两种干预技术,在 AudioSet、Kinetics-400 和 ImageNet-Captions 三个数据集上实现了 1.5×-4× 的鲁棒性改进。最后,我们证明,这些干预措施更好地利用了额外的模态(如果有的话),在 AudioSet 20K 上实现了 44.2 毫秒的竞争性结果。