Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e. not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature. We develop a novel QA architecture that allows us to reason over incomplete KGs, something current FVQA state-of-the-art (SOTA) approaches lack.We use KG Embeddings, a technique widely used for KG completion, for the downstream task of FVQA. We also employ a new image representation technique we call "Image-as-Knowledge" to enable this capability, alongside a simple one-step co-Attention mechanism to attend to text and image during QA. Our FVQA architecture is faster during inference time, being O(m), as opposed to existing FVQA SOTA methods which are O(N logN), where m is number of vertices, N is number of edges (which is O(m^2)). We observe that our architecture performs comparably in the standard answer-retrieval baseline with existing methods; while for missing-edge reasoning, our KG representation outperforms the SOTA representation by 25%, and image representation outperforms the SOTA representation by 2.6%.
翻译:以事实为基础的视觉问答(FVQA)是VQA的一个具有挑战性的变体,它要求QA系统将多种知识图(KG)中的事实纳入推理过程以得出答案。大型 KGs,特别是普通高级KGs,已知是不完整的,即并非所有不存在的事实都总是不正确。因此,能够对质量A的不完整 KGs进行解释,这是现实世界应用程序中的一个关键要求,文献中没有广泛处理。我们开发了一个新型的QA结构,让我们能够对不完整的 KGs(目前FVQA 代表艺术状态(SOTA)的方法缺乏)进行解释。我们使用KG Embeddings,这是在完成 FVQA的下游任务时广泛使用的一种技术。我们还使用一种新的图像表达技术,我们称之为“Image-as-Knowledge”,以便让这种能力得以实现,同时有一个简单的一步对回答机制在QA期间处理文本和图像。我们FVA的直观结构在SO-TA的运行过程中是快速的,而我们现有的直观的O-TA结构是Squal的直径结构在时间里为直径的O-ral的直径的O-q的直径。