Despite considerable recent progress in Visual Question Answering (VQA) models, inconsistent or contradictory answers continue to cast doubt on their true reasoning capabilities. However, most proposed methods use indirect strategies or strong assumptions on pairs of questions and answers to enforce model consistency. Instead, we propose a novel strategy intended to improve model performance by directly reducing logical inconsistencies. To do this, we introduce a new consistency loss term that can be used by a wide range of the VQA models and which relies on knowing the logical relation between pairs of questions and answers. While such information is typically not available in VQA datasets, we propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function. We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models, while being robust across different architectures and settings.
翻译:尽管在视觉问题解答(VQA)模式方面最近取得了相当大的进展,但不一致或自相矛盾的答案仍然使人对其真正的推理能力产生怀疑,然而,大多数拟议方法都使用间接战略或对问答的有力假设来实施模范一致性。相反,我们提议了一项新颖的战略,目的是通过直接减少逻辑上的不一致来改进模型的性能。为此,我们引入了一个新的一致性损失术语,该术语可以被各种VQA模式广泛使用,并依赖于了解对问答的逻辑关系。虽然在VQA数据集中通常没有这类信息,但我们提议使用专门的语言模型来推断这些逻辑关系,并在我们拟议的一致性丧失功能中使用这些逻辑关系。我们在VQA Interspect和DME数据集上进行了广泛的实验,并表明我们的方法可以改进最新的VQA模型,同时在不同的结构和环境中保持稳健。</s>