Generalizing beyond the experiences has a significant role in developing practical AI systems. It has been shown that current Visual Question Answering (VQA) models are over-dependent on the language-priors (spurious correlations between question-types and their most frequent answers) from the train set and pose poor performance on Out-of-Distribution (OOD) test sets. This conduct limits their generalizability and restricts them from being utilized in real-world situations. This paper shows that the sequence model architecture used in the question-encoder has a significant role in the generalizability of VQA models. To demonstrate this, we performed a detailed analysis of various existing RNN-based and Transformer-based question-encoders, and along, we proposed a novel Graph attention network (GAT)-based question-encoder. Our study found that a better choice of sequence model in the question-encoder improves the generalizability of VQA models even without using any additional relatively complex bias-mitigation approaches.
翻译:实践经验之外的一般化在开发实用的AI系统方面有着重要作用。已经表明,当前的视觉问答模型过分依赖火车组的语文优先模型(问题类型和最经常的答案之间纯正的关联),在外分发测试组上表现不佳。这种行为限制了它们的通用性,限制了它们在现实世界环境中的使用。本文件表明,在问题编码器中使用的序列模型结构在VQA模型的通用性方面起着重要作用。为了证明这一点,我们详细分析了现有的各种基于RNN和基于变换器的问题编码器,并同时,我们提出了一个新的基于分布式问题编码器的问题编码器。我们的研究发现,在问题编码器中更好地选择序列模型可以提高VQA模型的通用性,即使不使用任何其他比较复杂的反偏差方法。