Knowledge-based Visual Question Answering (VQA) expects models to rely on external knowledge for robust answer prediction. Though significant it is, this paper discovers several leading factors impeding the advancement of current state-of-the-art methods. On the one hand, methods which exploit the explicit knowledge take the knowledge as a complement for the coarsely trained VQA model. Despite their effectiveness, these approaches often suffer from noise incorporation and error propagation. On the other hand, pertaining to the implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA still remains largely unexplored. This work presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. In particular, we shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. As for the noise problem encountered by the retrieval operation on explicit knowledge, we design a novel scheme to create pseudo labels for effective knowledge supervision. This scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering. To validate the effectiveness of the proposed method, we conduct extensive experiments on the benchmark dataset. The experimental results reveal that our method outperforms existing baselines by a noticeable margin. Beyond the reported numbers, this paper further spawns several insights on knowledge utilization for future research with some empirical findings.
翻译:以知识为基础的视觉问题解答(VQA)预计模型将依赖外部知识进行可靠的回答预测。虽然这是重要的,但本文件发现了阻碍目前最新技术方法进步的一些主要因素。一方面,利用明确知识的方法将知识作为粗略培训的VQA模型的补充。尽管这些方法具有效力,但它们往往会受到噪音整合和传播错误的影响。另一方面,关于隐含知识的VQA的多模式隐含知识仍然基本上没有得到探索。这项工作为以知识为基础的VQA提供了一个统一的端到端检索器阅读器框架,为以知识为基础的VQA提供了一个统一的端到端检索器阅读器框架。特别是,我们从愿景-语言培训前模型中揭示了多模式的隐含知识,以挖掘其知识推理的潜力。关于对明确知识的检索操作遇到的噪音问题,我们设计了一个新办法,为有效的知识监督创建假标签。这个办法不仅能够为知识检索提供指导,而且还能将这些实例降低潜在错误的答案用于解答。我们用一些实验性的研究基底值来验证我们现有基准利用方法中的一些实验性结果。我们用了一些实验性研究基数来进一步试验。