Today's VQA models still tend to capture superficial linguistic correlations in the training set and fail to generalize to the test set with different QA distributions. To reduce these language biases, recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on diagnostic benchmarks for out-of-distribution testing. However, due to complex model design, these ensemble-based methods are unable to equip themselves with two indispensable characteristics of an ideal VQA model: 1) Visual-explainable: The model should rely on the right visual regions when making decisions. 2) Question-sensitive: The model should be sensitive to the linguistic variations in questions. To this end, we propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy. After training with CSST, VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. Specifically, CSST is composed of two parts: Counterfactual Samples Synthesizing (CSS) and Counterfactual Samples Training (CST). CSS generates counterfactual samples by carefully masking critical objects in images or words in questions and assigning pseudo ground-truth answers. CST not only trains the VQA models with both complementary samples to predict respective ground-truth answers, but also urges the VQA models to further distinguish the original samples and superficially similar counterfactual ones. To facilitate the CST training, we propose two variants of supervised contrastive loss for VQA, and design an effective positive and negative sample selection mechanism based on CSS. Extensive experiments have shown the effectiveness of CSST. Particularly, by building on top of model LMH+SAR, we achieve record-breaking performance on all OOD benchmarks.
翻译:今天的 VQA 模型仍然倾向于在培训组中捕捉表面语言关系,而没有在不同的 QA 分布上推广测试组。为了减少这些语言偏差,最近的 VQA 模型将引入一个只回答问题的辅助模型,以使目标VQA 模型的培训正规化,并实现分配外测试诊断基准的主导性性业绩。然而,由于模型设计复杂,这些基于共性的方法无法使自己具备理想的 VQA 模型的两个不可或缺的特征:(1) 视觉可解释性:模型在决策时应该依赖正确的视觉区域。(2) 问题敏感性:该模型应该敏感地关注问题中的语言差异。为此,我们提出了一个新型的模型 — — 模型 — 模拟性能合成和训练(CSST) 战略。在与 CSST, VQA 模型中的所有关键性对象和语言,通过直观性能解析性能和质变压模型(CSDRA),我们只能用直观性能解析的模型和直观性能。