Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single modality text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks. In this work, we empirically study how and whether such methods, applied in a bi-modal setting, can improve an existing VQA system's performance on the KBVQA task. We experiment with two large publicly available VQA datasets, (1) KVQA which contains mostly rare Wikipedia entities and (2) OKVQA which is less entity-centric and more aligned with common sense reasoning. Both lack explicit entity spans and we study the effect of different weakly supervised and manual methods for obtaining them. Additionally we analyze how recently proposed bi-modal and single modal attention explanations are affected by the incorporation of such entity enhanced representations. Our results show substantial improved performance on the KBVQA task without the need for additional costly pre-training and we provide insights for when entity knowledge injection helps improve a model's understanding. We provide code and enhanced datasets for reproducibility.
翻译:以知识为基础的视觉问题解答(KBVQA)是一项双模式任务,需要外部世界知识,以正确回答文本问题和相关图像。最近单一模式文本工作显示,知识注入到培训前的语言模型中,具体地说,实体强化知识图嵌入,可以改进下游实体中心任务的业绩。在这项工作中,我们通过经验研究,在双模式环境下应用的这些方法如何和是否改进现有的VQA系统在KBVQA任务中的绩效。我们试验两个大型公开提供的VQA数据集:(1)KVQA,其中大多包含罕见的维基百科实体实体实体实体实体核心,并且与常识推理更加一致。两个实体都缺乏明确的实体,我们研究不同的薄弱监管和人工方法的影响,以获得这些方法。此外,我们分析最近提出的双模式和单一模式关注解释如何受到纳入这种实体强化表述的影响。我们的结果显示,KBQA任务的业绩大为改善,而不需要额外的昂贵的预先培训,我们在实体知识注入增强的模型时提供洞察力。