With large-scale pre-training, the past two years have witnessed significant performance boost on the Visual Question Answering (VQA) task. Though rapid progresses have been made, it remains unclear whether these state-of-the-art (SOTA) VQA models are robust when encountering test examples in the wild. To study this, we introduce Adversarial VQA, a new large-scale VQA benchmark, collected iteratively via an adversarial human-and-model-in-the-loop procedure. Through this new benchmark, we present several interesting findings. (i) Surprisingly, during dataset collection, we find that non-expert annotators can successfully attack SOTA VQA models with relative ease. (ii) We test a variety of SOTA VQA models on our new dataset to highlight their fragility, and find that both large-scale pre-trained models and adversarial training methods can only achieve far lower performance than what they can achieve on the standard VQA v2 dataset. (iii) When considered as data augmentation, our dataset can be used to improve the performance on other robust VQA benchmarks. (iv) We present a detailed analysis of the dataset, providing valuable insights on the challenges it brings to the community. We hope Adversarial VQA can serve as a valuable benchmark that will be used by future work to test the robustness of its developed VQA models. Our dataset is publicly available at https://adversarialvqa. github.io/.
翻译:随着大规模培训前,过去两年目睹了视觉问答(VQA)任务的大幅业绩提升。虽然取得了快速进展,但仍不清楚这些最先进的(SOTA)VQA模型在野外遇到试验实例时是否强劲。研究后,我们引入了Aversarial VQA(一个新的大型VQA)基准,这是通过对抗性人和模范在现场运行的模型程序迭接而收集的一个新的大规模VQA基准。我们通过这一新的基准,提出了若干有趣的发现。 (i) 令人惊讶的是,在数据收集过程中,我们发现非专家的警告员能够相对容易地成功攻击SOTA VQA模型。 (ii) 我们测试我们新数据集上的各种SOTA VQA模型,以突出其脆弱性,发现大规模预先培训模式和对抗性培训方法只能取得远远低于其在标准VQAVA v2数据集上所能达到的成绩。 (iii) 当被视为数据增强时,我们的数据集可以公开地用来改进STA公司今后使用的详细数据基准。 (i)我们用来提供可靠的VA基准。我们用来提供可靠的VA检验我们所制定的数据。