Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stress test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the model's predicted answer is incorrect. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples. We conduct an extensive analysis of the collected adversarial examples and provide guidance on future research directions. We hope that this Adversarial VQA (AdVQA) benchmark can help drive progress in the field and advance the state of the art.
翻译:在最常用的视觉问题回答数据集(VQA v2)上的性能开始接近人的准确性。然而,在与最新的VQA模型进行互动时,问题显然远未解决。为了对VQA模型进行压力测试,我们根据人与人之间的对立实例进行基准测试。人类主体与最先进的VQA模型互动,并针对数据集中的每张图像,试图找到一个模型预测的答案不正确的问题。我们发现,在评估这些实例时,大量最先进的模型表现不佳。我们对收集到的对抗性实例进行广泛分析,并就未来的研究方向提供指导。我们希望,Adversarial VQA(AdVQA)基准能够帮助推动实地的进步,推进艺术的状态。