The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and commonsense knowledge. Using more knowledge sources increases the chance of retrieving more irrelevant or noisy facts, making it challenging to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using External knowledge (MAVEx), where the idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval. Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source. Our multi-modal setting is the first to leverage external visual knowledge (images searched using Google), in addition to textual knowledge in the form of Wikipedia sentences and ConceptNet concepts. Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results. Our code is available at https://github.com/jialinwu17/MAVEX
翻译:以知识为基础的视觉问题解答问题涉及回答除了图像内容之外还需要外部知识的问题。这种知识通常以多种形式出现,包括视觉、文字和普通知识。使用更多的知识来源增加了检索更多无关或吵闹的事实的机会,使得难以理解事实并找到答案。为了应对这一挑战,我们提议使用外部知识(MAVEx)来验证一套基于回答特定知识检索的有希望的回答候选人。这种知识通常以多种形式出现,包括视觉、文字和普通知识。MAVEx的目的是学习如何从吵闹的源获取相关知识,这些源知识是每个回答候选人信任的知识来源,如何利用该源来验证候选人。我们的多模式环境是首先利用外部视觉知识(利用谷歌搜索的图像),除此之外,还有以维基百科句和概念网络概念为形式的文字知识。我们与基于知识的 OK-VQA的实验,一个具有挑战性的VQA数据集,展示了MAVEVIx实现新状态-uf-maxm-mainal/Or’s的代码。