The task of Visual Question Answering (VQA) is known to be plagued by the issue of VQA models exploiting biases within the dataset to make its final prediction. Various previous ensemble based debiasing methods have been proposed where an additional model is purposefully trained to be biased in order to train a robust target model. However, these methods compute the bias for a model simply from the label statistics of the training data or from single modal branches. In this work, in order to better learn the bias a target VQA model suffers from, we propose a generative method to train the bias model directly from the target model, called GenB. In particular, GenB employs a generative network to learn the bias in the target model through a combination of the adversarial objective and knowledge distillation. We then debias our target model with GenB as a bias model, and show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE, and show state-of-the-art results with the LXMERT architecture on VQA-CP2.
翻译:视觉问答(VQA)任务被已知为存在问题,即VQA模型利用数据集内的偏差进行最终预测。
先前提出了各种基于集成的去偏置方法,其中额外的模型被特意训练成具有偏差,以训练出鲁棒的目标模型。
但是,这些方法仅从训练数据的标签统计或单模式分支中计算模型的偏差。
本研究为更好地学习目标VQA模型所遭受的偏差,提出了一种从目标模型直接训练偏置模型的生成方法,称为GenB。
特别地,GenB采用生成网络通过对抗性目标和知识蒸馏的组合来学习目标模型中的偏差。
然后,我们使用GenB作为偏置模型去除目标模型的偏置,并通过广泛实验证明在各种VQA偏差数据集(包括VQA-CP2、VQA-CP1、GQA-OOD和VQA-CE)上我们的方法的效果,并在VQA-CP2上展示了LXMERT体系结构的最新结果。