Visual Question Answering (VQA) models tend to rely on the language bias and thus fail to learn the reasoning from visual knowledge, which is however the original intention of VQA. In this paper, we propose a novel cause-effect look at the language bias, where the bias is formulated as the direct effect of question on answer from the view of causal inference. The effect can be captured by counterfactual VQA, where the image had not existed in an imagined scenario. Our proposed cause-effect look 1) is general to any baseline VQA architecture, 2) achieves significant improvement on the language-bias sensitive VQA-CP dataset, and 3) fills the theoretical gap in recent language prior based works.
翻译:视觉问题解答(VQA)模式倾向于依赖语言偏见,因此无法从视觉知识中学习推理,而视觉知识却是VQA的初衷。在本文中,我们提议对语言偏见进行新的因果关系研究,将这种偏向作为从因果关系推论角度回答问题的直接效果。其效果可以通过反事实VQA来捕捉,因为图像在想象的情景中不存在。我们提议的因果关系外观 (1) 对任何基线VQA结构来说都是一般性的,(2) 在语言偏向敏感VQA-CP数据集上取得了重大改进,(3) 填补了先前基于最近语言的作品中的理论空白。