We introduce an evaluation methodology for visual question answering (VQA) to better diagnose cases of shortcut learning. These cases happen when a model exploits spurious statistical regularities to produce correct answers but does not actually deploy the desired behavior. There is a need to identify possible shortcuts in a dataset and assess their use before deploying a model in the real world. The research community in VQA has focused exclusively on question-based shortcuts, where a model might, for example, answer "What is the color of the sky" with "blue" by relying mostly on the question-conditional training prior and give little weight to visual evidence. We go a step further and consider multimodal shortcuts that involve both questions and images. We first identify potential shortcuts in the popular VQA v2 training set by mining trivial predictive rules such as co-occurrences of words and visual elements. We then create VQA-CE, a new evaluation set made of CounterExamples i.e. questions where the mined rules lead to incorrect answers. We use this new evaluation in a large-scale study of existing models. We demonstrate that even state-of-the-art models perform poorly and that existing techniques to reduce biases are largely ineffective in this context. Our findings suggest that past work on question-based biases in VQA has only addressed one facet of a complex issue. The code for our method is available at https://github.com/cdancette/detect-shortcuts
翻译:我们引入了视觉问题解答(VQA)的评估方法,以更好地诊断捷径学习案例。当模型利用虚假的统计规律来得出正确的答案时,这些案例会发生,但实际上并没有部署理想的行为。我们需要在数据集中找出可能的捷径,并在实际世界中部署模型之前评估其使用情况。VQA的研究群体专门侧重于基于问题的捷径,例如,一个模型可以回答“天空的颜色”和“蓝色”的“蓝”问题,主要依靠先于问题-条件性培训,而很少重视视觉证据。我们更进一步地考虑涉及问题和图像的多式联运捷径。我们首先找出了广受欢迎的 VQA v2 培训中的潜在捷径,这些捷径是由采矿小的预知规则,例如语言和视觉元素的共同出现。我们随后创建了“VQA-CE”,这是对反Examples (e) 问题的新评价集,即开采规则导致错误解答的问题。我们在对现有模型的大规模研究中使用了这种新评估。我们用新的评估方法更进一步考虑涉及问题和图像的多疑点。我们用的是, QErvial-A模型在目前的一个问题上只处理了我们目前的方法。