Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong. These sub-questions pertain to lower level visual concepts in the image that models ideally should understand to be able to answer the higher level question correctly. To address this, we first present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image, and use this to evaluate VQA models on their ability to identify the relevant sub-questions needed to answer a reasoning question. Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an <image, reasoning-question> pair. We show that SOrT improves model consistency by upto 6.5% points over existing baselines, while also improving visual grounding.
翻译:最近对视觉问题解答(VQA)的研究显示,最先进的模型在对世界的理解方面前后不一致 -- -- 它们回答了看起来困难的问题,需要正确推理,但相关的子问题却更加简单。 这些子问题涉及模型最好理解的图像中较低层次的视觉概念,以便能够正确回答更高层次的问题。 为了解决这个问题,我们首先提出一种基于梯度的可解释性方法,以确定与图像推理问题最密切相关的问题,并利用这个方法评估VQA模型是否有能力确定回答推理问题所需的相关子问题。 其次,我们提出一种反比的梯度学习方法,称为子问题引导图灵(SOORT),它鼓励模型将相关子问题排在比不相关的问题高的“图像、推理问题”配对。 我们显示SorT将模型的一致性提高到现有基线的6.5%,同时改进视觉定位。