A number of studies point out that current Visual Question Answering (VQA) models are severely affected by the language prior problem, which refers to blindly making predictions based on the language shortcut. Some efforts have been devoted to overcoming this issue with delicate models. However, there is no research to address it from the angle of the answer feature space learning, despite of the fact that existing VQA methods all cast VQA as a classification task. Inspired by this, in this work, we attempt to tackle the language prior problem from the viewpoint of the feature space learning. To this end, an adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space under each question type properly. As a result, the limited patterns within the language modality are largely reduced, thereby less language priors would be introduced by our method. We apply this loss function to several baseline models and evaluate its effectiveness on two VQA-CP benchmarks. Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models with an absolute performance gain of 15\% on average, strongly verifying the potential of tackling the language prior problem in VQA from the angle of the answer feature space learning.
翻译:一些研究指出,目前的视觉问题解答模式受到先前语言问题(即盲目地根据语言快捷键作出预测)的严重影响,一些努力致力于用微妙的模式克服这一问题,然而,尽管现有的视觉问题解答模式方法都使VQA成为分类任务,但没有从答案空间学习的视角研究解决这一问题,因此,在这项工作的启发下,我们试图从特征空间学习的角度处理先前语言问题。为此,调整的余弦调损失旨在适当区分每个问题类型下的频繁和稀少的回答特征空间。因此,语言模式中有限的模式大为减少,因此,我们的方法将减少先前的语言。我们将这一损失功能应用于几个基线模型,并评估其在两个VQA-CP基准上的有效性。实验结果表明,我们调整的差量损失能够大大加强基线模型,平均取得绝对性能收益15 ⁇,从回答空间特征的角度有力地核查解决VQA先前语言问题的可能性。