Despite the great progress of Visual Question Answering (VQA), current VQA models heavily rely on the superficial correlation between the question type and its corresponding frequent answers (i.e., language priors) to make predictions, without really understanding the input. In this work, we define the training instances with the same question type but different answers as \textit{superficially similar instances}, and attribute the language priors to the confusion of VQA model on such instances. To solve this problem, we propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances. Specifically, for each training instance, we first construct a set that contains its superficially similar counterparts. Then we exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space. In this way, the VQA model is forced to further focus on the other parts of the input beyond the question type, which helps to overcome the language priors. Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2. Codes are available at \href{https://github.com/wyk-nku/Distinguishing-VQA.git}{Distinguishing-VQA}.
翻译:尽管视觉问答(VQA)取得了巨大进展,但当前的VQA模式在很大程度上依赖问题类型与相应常见答案(即语言前科)之间的表面关联,以作出预测,而没有真正理解投入。在这项工作中,我们用同样的问题类型界定培训案例,但不同的答案则与\ textit{urfically subility situal situations} 相同,并将之前的语言与VQA模式在这类案例中的混淆联系起来。为了解决这一问题,我们提议了一个新的培训框架,明确鼓励VQA模式区分表面相似的例子。具体地说,我们为每个培训实例首先建立一个包含其表面相似对应的数据集。然后我们利用拟议的区分模块来增加实例与回答空间对应方之间的距离。这样,VQAA模型被迫进一步侧重于问题类型以外的投入的其他部分,这有助于克服先前的语言。实验结果显示,我们的方法在VQA-CP vCP v2. 上实现了州-艺术表现。Cases在 asiv-Q.Ahrik/Disuk_Vgiusing@Disqu_Vgiv_Q.