While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present \textit{MUTANT}, a training paradigm that exposes the model to perceptually similar, yet semantically distinct \textit{mutations} of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, \textit{MUTANT} does not rely on the knowledge about the nature of train and test answer distributions. \textit{MUTANT} establishes a new state-of-the-art accuracy on VQA-CP with a $10.57\%$ improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.
翻译:虽然在直观回答引导板上取得了进展,但模型往往在i.d.设置下的数据集中使用虚假的关联和前缀,例如VQA-CP挑战。因此,模型利用一致性限制的培训目标来理解投入(问题图像配对)的语义变化对产出的影响(答案),本文与关于VQA-CP的现有方法不同,我们提出\textit{Mutant},这是一个培训范例,它使模型暴露于输入的视觉相似性,但却在语义上截然不同。\ textit{MUTANT}通过一个10.57美元的问题来建立VQA-CP的新状态的精确性。我们的工作打开了使用静态输入的路径。