Since its appearance, Visual Question Answering (VQA, i.e. answering a question posed over an image), has always been treated as a classification problem over a set of predefined answers. Despite its convenience, this classification approach poorly reflects the semantics of the problem limiting the answering to a choice between independent proposals, without taking into account the similarity between them (e.g. equally penalizing for answering cat or German shepherd instead of dog). We address this issue by proposing (1) two measures of proximity between VQA classes, and (2) a corresponding loss which takes into account the estimated proximity. This significantly improves the generalization of VQA models by reducing their language bias. In particular, we show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models. Finally, by combining our method with a language bias reduction approach, we report SOTA-level performance on the challenging VQAv2-CP dataset.
翻译:自其出现以来,视觉问题回答(VQA,即回答一个图像上的问题)一直被视为一组预先定义的答案的分类问题。尽管方便,这种分类方法没有充分反映将回答限于独立提案之间选择的问题的语义,没有考虑到它们之间的相似性(例如,同样惩罚回答猫或德国牧羊人而不是狗)。我们通过提出(1) VQA类之间的近距离两个衡量标准,以及(2)考虑到估计接近程度的相应损失,来解决这一问题。这大大改善了VQA模型的普遍化,减少了其语言偏差。特别是,我们表明,我们的方法是完全模型的,因为它允许与三种不同的VQA模式一致地改进。最后,我们通过将我们的方法与语言偏差减少方法结合起来,报告了具有挑战性的VQAv2-CP数据集的SOTA级业绩。