Benefiting from large-scale pre-training, we have witnessed significant performance boost on the popular Visual Question Answering (VQA) task. Despite rapid progress, it remains unclear whether these state-of-the-art (SOTA) models are robust when encountering examples in the wild. To study this, we introduce Adversarial VQA, a new large-scale VQA benchmark, collected iteratively via an adversarial human-and-model-in-the-loop procedure. Through this new benchmark, we discover several interesting findings. (i) Surprisingly, we find that during dataset collection, non-expert annotators can easily attack SOTA VQA models successfully. (ii) Both large-scale pre-trained models and adversarial training methods achieve far worse performance on the new benchmark than over standard VQA v2 dataset, revealing the fragility of these models while demonstrating the effectiveness of our adversarial dataset. (iii) When used for data augmentation, our dataset can effectively boost model performance on other robust VQA benchmarks. We hope our Adversarial VQA dataset can shed new light on robustness study in the community and serve as a valuable benchmark for future work.
翻译:我们从大规模培训前程序中受益,在广受欢迎的视觉问答(VQA)任务上,我们目睹了显著的绩效提升。尽管取得了迅速的进展,但仍然不清楚这些最先进的(SOTA)模型在野外遇到例子时是否强劲。研究后,我们引入了Adversarial VQA(一个新的大规模VQA)基准,这是一个通过对抗性人和模范在现场的模拟程序迭代收集的新的大规模VQA基准。通过这一新的基准,我们发现了一些有趣的发现。 (一) 令人惊讶的是,我们发现在数据收集期间,非专家的警告员可以很容易地攻击SOTA VQA模型。 (二) 大规模预先培训模式和对抗性培训方法在新基准上的表现比标准VQA v2数据集要差得多,暴露了这些模型的脆弱性,同时展示了我们的对抗性数据集的有效性。 (三) 在用于数据增强时,我们的数据集可以有效地提升其他可靠的VQA基准的模型性能。我们希望我们的Adversari VQA数据库能够成为社区关于未来稳健的基准。