Most existing Visual Question Answering (VQA) systems tend to overly rely on language bias and hence fail to reason from the visual clue. To address this issue, we propose a novel Language-Prior Feedback (LPF) objective function, to re-balance the proportion of each answer's loss value in the total VQA loss. The LPF firstly calculates a modulating factor to determine the language bias using a question-only branch. Then, the LPF assigns a self-adaptive weight to each training sample in the training process. With this reweighting mechanism, the LPF ensures that the total VQA loss can be reshaped to a more balanced form. By this means, the samples that require certain visual information to predict will be efficiently used during training. Our method is simple to implement, model-agnostic, and end-to-end trainable. We conduct extensive experiments and the results show that the LPF (1) brings a significant improvement over various VQA models, (2) achieves competitive performance on the bias-sensitive VQA-CP v2 benchmark.
翻译:大多数现有的视觉问题解答系统往往过分依赖语言偏见,因此无法从视觉线索中找到理由。为了解决这一问题,我们提议了一个新的语言优先反馈目标功能,以重新平衡每个答案在全部VQA损失中的损失价值的比例。LPF首先用一个只回答问题的分支计算出一个调制因素,以确定语言偏差。然后,LPF给培训过程中的每个培训样本赋予了自适应权重。有了这一再加权机制,LPF确保了VQA的全部损失能够改造成一种更加平衡的形式。通过这个方法,需要某些视觉信息的样本在培训中将有效使用。我们的方法简单易于执行、模式敏感和端到端可培训。我们进行了广泛的实验,结果显示LPF(1) 大大改进了各种VQA模型,(2) 在对偏见敏感的VQA-CP v2基准上取得了竞争性的性表现。