As multimodal learning finds applications in a wide variety of high-stakes societal tasks, investigating their robustness becomes important. Existing work has focused on understanding the robustness of vision-and-language models to imperceptible variations on benchmark tasks. In this work, we investigate the robustness of multimodal classifiers to cross-modal dilutions - a plausible variation. We develop a model that, given a multimodal (image + text) input, generates additional dilution text that (a) maintains relevance and topical coherence with the image and existing text, and (b) when added to the original text, leads to misclassification of the multimodal input. Via experiments on Crisis Humanitarianism and Sentiment Detection tasks, we find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model. Metric-based comparisons with several baselines and human evaluations indicate that our dilutions show higher relevance and topical coherence, while simultaneously being more effective at demonstrating the brittleness of the multimodal classifiers. Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations, especially in human-facing societal applications. The code and other resources are available at https://claws-lab.github.io/multimodal-robustness/.
翻译:由于多式联运学习在各种各样的高层次社会任务中找到应用,因此,调查其稳健性就变得很重要。现有工作的重点是了解愿景和语言模式的稳健性,以无法察觉的基准任务变化。在这项工作中,我们调查多式联运分类者对跨模式稀释的稳性----一种合理的变异。我们开发了一种模式,根据多式联运(图像+文本)的投入,产生额外的稀释文本,(a) 与图像和现有文本保持相关性和时标的一致性,以及(b) 添加到原始文本时,导致对多式联运投入的分类错误。关于危机人道主义和感知检测任务的动态实验,我们发现,基于特定任务混合的多式联运分类者的绩效分别下降了23.3%和22.5%,这是我们模型产生的稀释的一种可能变异性。基于矩阵与若干基线和人文评估的比较表明,我们的稀释方法显示了更高的相关性和时标的一致性,同时,导致对多式联运者投入的扭曲性。我们的工作旨在突出和鼓励对基于任务的融合的多式联运模式进行进一步的研究。在现实的模型上,也鼓励了其他的多式模式上的现有研究。