Analysis of vision-and-language models has revealed their brittleness under linguistic phenomena such as paraphrasing, negation, textual entailment, and word substitutions with synonyms or antonyms. While data augmentation techniques have been designed to mitigate against these failure modes, methods that can integrate this knowledge into the training pipeline remain under-explored. In this paper, we present \textbf{SDRO}, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting, along with an ensembling technique to leverage these transformations during inference. Experiments on benchmark datasets with images (NLVR$^2$) and video (VIOLIN) demonstrate performance improvements as well as robustness to adversarial attacks. Experiments on binary VQA explore the generalizability of this method to other V\&L tasks.
翻译:对视觉和语言模型的分析显示,在语言现象下,例如抛光、否定、文字要求和同义词或同义或同义词替代词等语言现象下,这些模型的易碎性已经暴露出来。虽然数据增强技术是为了减少这些失败模式而设计的,但将这种知识纳入培训管道的方法仍然未得到充分探讨。在本文件中,我们介绍了一种模型-不可知性方法,在分布式强力优化环境中利用一套语言变换,以及一种组合技术,在推断期间利用这些变换。用图像(NLVR$2美元)和视频(VIOLIN)对基准数据集进行的实验显示了性能改进以及对抗性攻击的稳健性。关于二进制VQA的实验探讨了这种方法对其他V ⁇ L任务的一般性。