Knowledge-based visual question answering (KVQA) task aims to answer questions that require additional external knowledge as well as an understanding of images and questions. Recent studies on KVQA inject an external knowledge in a multi-modal form, and as more knowledge is used, irrelevant information may be added and can confuse the question answering. In order to properly use the knowledge, this study proposes the following: 1) we introduce a novel semantic inconsistency measure computed from caption uncertainty and semantic similarity; 2) we suggest a new external knowledge assimilation method based on the semantic inconsistency measure and apply it to integrate explicit knowledge and implicit knowledge for KVQA; 3) the proposed method is evaluated with the OK-VQA dataset and achieves the state-of-the-art performance.
翻译:基于知识的直观回答(KVQA)任务旨在回答需要更多外部知识以及了解图像和问题的问题。最近关于KVQA的研究以多种模式的形式和随着更多的知识的使用,注入了外部知识,增加了不相干的信息,并可能混淆问题的回答。为正确使用知识,本研究提出以下建议:(1) 我们引入了一种根据字幕不确定性和语义相似性计算出的新语义不一致性测量法;(2) 我们建议采用一种新的基于语义不一致度测量法的外部知识吸收法,并应用它来结合对KVQA的明确知识和隐含知识;(3) 与OK-VQA数据集一起评估拟议方法,并实现最先进的性能。