Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by minimizing the standard cross-entropy loss. To more effectively leverage augmented data, we build on the recent success in contrastive learning. We propose a novel training paradigm (ConClaT) that optimizes both cross-entropy and contrastive losses. The contrastive loss encourages representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of representations for answer prediction. We find that optimizing both losses -- either alternately or jointly -- is key to effective training. On the VQA-Rephrasings benchmark, which measures the VQA model's answer consistency across human paraphrases of a question, ConClaT improves Consensus Score by 1 .63% over an improved baseline. In addition, on the standard VQA 2.0 benchmark, we improve the VQA accuracy by 0.78% overall. We also show that ConClaT is agnostic to the type of data-augmentation strategy used.
翻译:最近视觉问题解答(VQA)模型显示,VQA基准的成绩令人印象深刻,但对投入问题的语言差异很小,仍然敏感。现有的方法通过从视觉问题生成模型或对抗性扰动模型中用问题参数补充数据集,从视觉问题生成模型或对抗性扰动模型中增加数据,从而解决这一问题。这些方法利用合并数据学习解答分类器,最大限度地减少标准的跨物种损失。为了更有效地利用扩大的数据,我们以最近的对比性学习成功为基础。我们提出了一个新的培训模式(ConClaT),优化交叉作物和对比性损失。对比性损失鼓励对问题的语言差异进行强有力的表述,而交叉作物损失则保留了回答预测的表达的歧视性力量。我们发现,优化这两种损失 -- -- 交替或联合 -- -- 是有效培训的关键。关于VQA-Rephrasings基准,该基准衡量VQA模型在某个问题的语句中的一致性,ConClaT在改进基线后将共识评分提高1.63 %。此外,在标准VQA-A的总体基准中,我们用了VA的精确度,我们用了一个VA-QQ的精确度来显示了C总基准。