Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have sought to use a large language model (i.e., GPT-3) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of GPT-3 as the provided input information is insufficient. In this paper, we present Prophet -- a conceptually simple framework designed to prompt GPT-3 with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better comprehend the task thus enhancing its capacity. Prophet significantly outperforms all existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.
翻译:基于知识的视觉问题解答(VQA)要求超越图像的外部知识才能解答问题。早期研究从清晰的知识库(KBs)检索到所需的知识,这些知识库往往给问题带来不相干的信息,从而限制其模型的性能。最近的工作试图使用大型语言模型(即GPT-3)作为获得必要回答知识的隐含知识引擎。尽管这些方法取得了令人鼓舞的结果,但我们争辩说,它们还没有完全激活GPT-3的能力,因为所提供的投入信息不足。在本文中,我们介绍了一个概念简单的框架,旨在促使GPT-3获得GPT-3,并给出对基于知识的VQA的答案。具体地说,我们首先在没有外部知识的情况下,就基于知识的VQA数据集培养了一个香草VQA模型。 之后,我们从模型中提取了两种补充性回答:回答对象和答觉实例。最后,两种答案的超理理论被编码成提示,使GPT-3能够更好地理解任务,从而增强基于知识的VQQ的能力。 先知55-A测试了所有现有的VA-%数据系统。</s>