Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.
翻译:知识驱动的视觉问答涉及需要超出图像以获得正确答案的世界知识的问题。像GPT-3这样的大型语言模型在这项任务中特别有帮助,因为它们具有强大的知识检索和推理能力。为了使LM理解图片,先前的工作使用一种字幕模型将图像转换为文本。然而,在单个字幕句子中总结图像时,要描述哪些视觉实体经常不足。通用图像字幕经常会错过LM回答视觉问题所必需的视觉细节。为了解决这个挑战,我们提出了PromptCap(Prompt-guided image Captioning),一种字幕模型,旨在成为图像和黑匣子LM之间更好的连接器。与通用字幕不同,PromptCap采用自然语言提示来控制生成的字幕中要描述的视觉实体。提示包含一个问题,字幕应该有助于回答这个问题。为了避免额外的注释,PromptCap使用GPT-3和现有数据集合成的示例进行训练。我们展示了PromptCap在现有管道中的有效性,在该管道中,GPT-3通过图像字幕提示进行VQA。 PromptCap明显优于通用字幕,并在基于知识的VQA任务上实现了最先进的准确性(OK-VQA为60.4%,A-OKVQA为59.6%)。 WebQA上的零-shot结果表明,PromptCap可以很好地推广到未见过的领域。