Recent VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language. In this paper, we investigate how to capture and mitigate language bias in VQA. Motivated by causal effects, we proposed a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers and reduce the language bias by subtracting the direct language effect from the total causal effect. Experiments demonstrate that our proposed counterfactual inference framework 1) is general to various VQA backbones and fusion strategies, 2) achieves competitive performance on the language-bias sensitive VQA-CP dataset while performs robustly on the balanced VQA v2 dataset.
翻译:最近的VQA模式可能倾向于以语言偏见作为捷径,从而无法充分从视觉和语言两方面学习多模式知识。在本文中,我们调查如何在VQA中捕捉和减少语言偏见。出于因果关系的动机,我们提出了一个新的反事实推论框架,使我们能够将语言偏见作为问题对答案的直接因果关系,并通过从总的因果关系效果中减去直接语言影响来减少语言偏见。实验表明,我们提议的反事实推论框架 1 1 与各种VQA的骨干和聚合战略是一般性的,2 在对语言偏见敏感的VQA-CP数据集上取得竞争性表现,同时在平衡的VQA v2数据集上表现有力。