Since its inception, Visual Question Answering (VQA) is notoriously known as a task, where models are prone to exploit biases in datasets to find shortcuts instead of performing high-level reasoning. Classical methods address this by removing biases from training data, or adding branches to models to detect and remove biases. In this paper, we argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems. We train a visual oracle and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases compared to standard models. We propose to study the attention mechanisms at work in the visual oracle and compare them with a SOTA Transformer-based model. We provide an in-depth analysis and visualizations of reasoning patterns obtained with an online visualization tool which we make publicly available (https://reasoningpatterns.github.io). We exploit these insights by transferring reasoning patterns from the oracle to a SOTA Transformer-based VQA model taking standard noisy visual inputs via fine-tuning. In experiments we report higher overall accuracy, as well as accuracy on infrequent answers for each question type, which provides evidence for improved generalization and a decrease of the dependency on dataset biases.
翻译:自开始以来,视觉问题解答(VQA)就被臭名昭著地称为一项任务,模型很容易利用数据集中的偏见寻找捷径,而不是执行高层次推理。古典方法解决这一问题的方法是从培训数据中消除偏见,或者在模型中增加分支,以发现和消除偏见。在本文中,我们认为,视觉不确定性是妨碍成功学习视觉和语言问题推理的主导因素。我们训练视觉神器和大规模研究提供了实验性证据,证明与标准模型相比,它更不易利用虚伪的数据集偏差。我们提议研究视觉神器工作中的注意机制,并将这些机制与SOTA变异模型进行比较。我们提供了我们公开的在线视觉化工具(https://iringpatterns.github.io)所获取的推理模型的深入分析和可视化。我们利用这些洞察力,通过微调将理性模式从神器转到SOTA变异器VQA模型中,采用标准的模糊的视觉输入。我们在实验中报告总体精确性分析和精确性分析,以每种精确性作为精确度的答案。