Joint visual and language modeling on large-scale datasets has recently shown a good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of such models against various real-world perturbations focusing on video and language. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different textual perturbations. The study reveals some interesting findings: 1) The studied models are more robust when text is perturbed versus when video is perturbed 2) The transformer text encoder is more robust on non-semantic changing text perturbations and visual perturbations compared to word embedding approaches. 3) Using two-branch encoders in isolation is typically more robust than when architectures use cross-attention. We hope this study will serve as a benchmark and guide future research in robust multimodal learning.
翻译:与单一模式学习相比,大规模数据集的联合视觉和语言建模最近显示,在多模式任务方面,与单一模式学习相比,在多模式任务方面取得了良好进展。然而,尚未研究这些方法对于真实世界扰动的稳健性。在这项工作中,我们针对以视频和语言为重点的各种真实世界扰动,首次对此类模型进行了广泛的稳健性研究。我们侧重于文本到视频的检索,并提出了两个大型基准数据集,即MSRVTT-P和YouCook2-P,它们使用90种不同的视觉和35种不同的文字扰动。研究揭示了一些有趣的发现:(1) 当文本被绕过而不是视频被扰动时,所研究的模型更为稳健。(2) 变压器的文本编译器对非mantic的文本扰动和视觉扰动比词嵌入式方法更为稳健。(3) 单独使用两根基的编码通常比结构使用交叉感动时更为稳健。我们希望这项研究将成为稳健的多式联运学习的基准和指导未来研究。