Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge given the visual question and then predicts the answer based on the retrieved content. However, the retrieved knowledge is often inadequate. Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question. Also, the naturally available supervision (whether the passage contains the correct answer) is weak and does not guarantee question relevancy. To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge. Experiments show that our EnFoRe model achieves superior retrieval performance on OK-VQA, the currently largest outside-knowledge VQA dataset. We also combine the retrieved knowledge with state-of-the-art VQA models, and achieve a new state-of-the-art performance on OK-VQA.
翻译:大多数外部知识的视觉问题解答系统(OK-VQA)采用一个两阶段框架,首先根据视觉问题检索外部知识,然后根据检索的内容预测答案。然而,检索的知识往往不充分。检索往往过于笼统,无法涵盖回答问题所需的具体知识。此外,自然可用的监督(无论段落是否包含正确的答案)很薄弱,不能保证问题的相关性。为了解决这些问题,我们提议了一个实体-视野检索(EnFoRe)模型,在培训期间提供更有力的监督,并承认与问题有关的实体,以帮助检索更具体的知识。实验显示,我们的EnFoRe模型在目前最大的外部知识VQA数据集即 OK-VQA 上取得了较好的检索性能。我们还将检索的知识与最新的VQA 模型结合起来,并在 OK-VQA 上取得新的最新技术表现。