Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3's power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.
翻译:以知识为基础的视觉问题解答(VQA)涉及回答需要外部知识而不是图像中存在的问题。 现有方法首先从外部资源中获取知识,然后对选定的知识、输入图像和答案预测问题进行解释。 但是,这一两步方法可能导致不匹配,从而可能限制VQA的性能。 例如,所获取的知识可能吵杂,与问题无关,推理过程中重新形成的知识特征可能偏离知识库(KB)的原始含义。 为了应对这一挑战,我们提议PICa,这是一种简单而有效的方法,通过使用基于知识的 VQA 图像描述GPT3, 从而通过使用基于知识的 VQA 描述, 由GPT-3 在知识检索和回答方面的力量启发,而不是像以前的工作那样使用结构化的 KBSB, 我们把GPT-3作为隐含和无结构的KB; 具体地说,我们首先将图像转换为我们GPTVT-3能够理解的(或标签), 然后将GPT-3 调整GPTA 以先用几张的方式解决VA任务, QA 提高VQ 质量,然后以少数方式对成本基准进行精化分析。 我们只是用一些格式的文本中的数据,, 将如何用一些格式,然后用我们用16A 样样化的文本来更精确化的文本化的文本化的文本化的文本化地展示。