Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult to analyze. This motivates us to introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality. With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data along each of these axes. We study four existing methods, including two neural symbolic methods NSCL and NSVQA, and two non-symbolic methods FiLM and mDETR; and our proposed method, probabilistic NSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA outperforms other methods on three of the four domain shift factors. Our results suggest that disentangling reasoning and perception, combined with probabilistic uncertainty, form a strong VQA model that is more robust to domain shifts. The dataset and code are released at https://github.com/Lizw14/Super-CLEVR.
翻译:视觉问题解答(VQA) 模型在分配外的数据和领域一般化斗争方面往往表现不佳。由于这项任务的多模式性质,多种变异因素相互交织,难以分析。这促使我们引入了虚拟基准,即超级CLEVR, VQA 域变换的不同因素可以分离,以便独立研究其效果。考虑了四个因素:视觉复杂性、问题冗余、概念分布和概念构成性。随着可控制生成的数据,超级CLEVR使我们能够在测试数据与每个轴的训练数据不同的情况下测试VQA方法。我们研究了四种现有方法,包括两种神经象征方法NSCLL和NSVQA,以及两种非同步方法FILM和MDETR;以及我们拟议的方法,即概率性NSVQA(P-NSVQA),该方法以不确定性推理推理扩展NSVVQA。P-NSVQA在四种域变换因素中,其他方法优于三种变换模型。我们研究的结果显示,一种更稳性的变位式数据推判和变式数据。