Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.
翻译:多模态学习是指学习多个异构输入模态,如视频、音频和文本。本文关注的是了解模型在训练和部署之间模态类型不同的情况下的行为,这种情况在多模态学习应用于硬件平台时自然出现。我们提出了一个多模态鲁棒性框架,系统地分析了常见的多模态表示学习方法。此外,我们还确定了这些方法的鲁棒性缺陷,并提出了两种干预技术,在三个数据集AudioSet、Kinetics-400和ImageNet-Captions上实现了1.5倍至4倍的鲁棒性改善。最后,我们演示了这些干预措施更好地利用其他模态,如果存在的话,可以在AudioSet 20K上获得44.2的mAP竞争结果。