PromptCap: 基于提示的图像描述与GPT-3的视觉问答 (PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3) - 专知论文

会员服务 ·

0

图像字幕 · 视觉问答 · GPT-3 · Prompt · entity ·

2023 年 3 月 21 日

PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3

翻译：PromptCap: 基于提示的图像描述与GPT-3的视觉问答

Yushi Hu,Hang Hua,Zhengyuan Yang,Weijia Shi,Noah A. Smith,Jiebo Luo

Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.

翻译：知识驱动的视觉问答涉及需要超出图像以获得正确答案的世界知识的问题。像GPT-3这样的大型语言模型在这项任务中特别有帮助，因为它们具有强大的知识检索和推理能力。为了使LM理解图片，先前的工作使用一种字幕模型将图像转换为文本。然而，在单个字幕句子中总结图像时，要描述哪些视觉实体经常不足。通用图像字幕经常会错过LM回答视觉问题所必需的视觉细节。为了解决这个挑战，我们提出了PromptCap（Prompt-guided image Captioning），一种字幕模型，旨在成为图像和黑匣子LM之间更好的连接器。与通用字幕不同，PromptCap采用自然语言提示来控制生成的字幕中要描述的视觉实体。提示包含一个问题，字幕应该有助于回答这个问题。为了避免额外的注释，PromptCap使用GPT-3和现有数据集合成的示例进行训练。我们展示了PromptCap在现有管道中的有效性，在该管道中，GPT-3通过图像字幕提示进行VQA。 PromptCap明显优于通用字幕，并在基于知识的VQA任务上实现了最先进的准确性（OK-VQA为60.4％，A-OKVQA为59.6％）。 WebQA上的零-shot结果表明，PromptCap可以很好地推广到未见过的领域。

0

相关内容

图像字幕

图像字幕（Image Captioning）,是指从图像生成文本描述的过程，主要根据图像中物体和物体的动作。

CVPR 2023 | Prophet: 用小模型启发大语言模型解决外部知识图像问答

CVPR 2023 | Prophet: 用小模型启发大语言模型解决外部知识图像问答

专知会员服务

54+阅读 · 2023年4月1日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【TPAMI2022】从展示到讲述: 基于深度学习的图像描述研究综述论文，From Show to Tell: A Survey on Deep Learning-based Image Captioning

【TPAMI2022】从展示到讲述: 基于深度学习的图像描述研究综述论文，From Show to Tell: A Survey on Deep Learning-based Image Captioning

专知会员服务

24+阅读 · 2022年3月1日

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

专知会员服务

26+阅读 · 2022年3月1日

最新《图像描述Image Captioning》综述论文，22页pdf220篇文献

专知会员服务

43+阅读 · 2021年7月17日

近期必读的七篇AAAI 2021【问答（QA）】相关论文和代码

专知会员服务

55+阅读 · 2021年2月2日

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

专知会员服务

73+阅读 · 2020年5月30日

【CVPR2020-英伟达】从图像集合中学习自监督视点，Self-Supervised Viewpoint Learning From Image Collections

【CVPR2020-英伟达】从图像集合中学习自监督视点，Self-Supervised Viewpoint Learning From Image Collections

专知会员服务

24+阅读 · 2020年4月4日

【NLP| 推荐文章】基于知识库的问答系统关键技术综述（Core techniques of question answering systems over knowledge bases：a survey）

专知会员服务

47+阅读 · 2019年11月24日

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

PaperWeekly

3+阅读 · 2022年9月21日

NAACL 2022 | 基于Prompt的文本生成迁移学习

NAACL 2022 | 基于Prompt的文本生成迁移学习

PaperWeekly

1+阅读 · 2022年8月31日

论文浅尝 | KM-BART：用于视觉常识生成的知识增强多模态BART

论文浅尝 | KM-BART：用于视觉常识生成的知识增强多模态BART

开放知识图谱

0+阅读 · 2022年5月29日

【CV+NLP】更有智慧的眼睛：图像描述（Image Caption）&视觉问答（VQA）综述（上）

【CV+NLP】更有智慧的眼睛：图像描述（Image Caption）&视觉问答（VQA）综述（上）

极市平台

79+阅读 · 2019年1月20日

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

泡泡机器人SLAM

23+阅读 · 2019年1月18日

Image Captioning 36页最新综述， 161篇参考文献

Image Captioning 36页最新综述， 161篇参考文献

专知

90+阅读 · 2018年10月23日

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

专知

10+阅读 · 2018年4月12日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【干货】基于属性学习和额外知识库的图像描述生成和视觉问答

【干货】基于属性学习和额外知识库的图像描述生成和视觉问答

专知

18+阅读 · 2017年12月25日

基于天然产物Aspernigerin的新型几丁质合成抑制剂的设计、合成及生物活性研究

国家自然科学基金

0+阅读 · 2014年12月31日

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

中心反折射全景相机标定- - 共形几何代数方法

国家自然科学基金

0+阅读 · 2013年12月31日

基于Ontology的藏文语料库检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于不饱和腈合成子的萘啶骨架定向构筑

国家自然科学基金

0+阅读 · 2012年12月31日

多波束测深声纳孔径合成机理与三维高分辨探测技术

国家自然科学基金

0+阅读 · 2012年12月31日

风格化人体运动合成新方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

基于边缘点的折反射图像立体匹配与三维重建研究

国家自然科学基金

0+阅读 · 2009年12月31日

应用于面向问题的自动文摘任务的篇章分析关键技术研究

国家自然科学基金

0+阅读 · 2008年12月31日

Privacy-Preserving Prompt Tuning for Large Language Model Services

Arxiv

0+阅读 · 2023年5月10日

Large Language Models Need Holistically Thought in Medical Conversational QA

Arxiv

0+阅读 · 2023年5月10日

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Arxiv

0+阅读 · 2023年5月9日

Question-controlled Text-aware Image Captioning

Arxiv

10+阅读 · 2021年8月4日

From Show to Tell: A Survey on Image Captioning

Arxiv

15+阅读 · 2021年7月14日

Exploring Visual Relationship for Image Captioning

Exploring Visual Relationship for Image Captioning

Arxiv

15+阅读 · 2018年9月19日

CNN+CNN: Convolutional Decoders for Image Captioning

Arxiv

21+阅读 · 2018年5月23日

Image Captioning

Arxiv

11+阅读 · 2018年5月13日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

相关VIP内容

CVPR 2023 | Prophet: 用小模型启发大语言模型解决外部知识图像问答

CVPR 2023 | Prophet: 用小模型启发大语言模型解决外部知识图像问答

专知会员服务

54+阅读 · 2023年4月1日

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【TPAMI2022】从展示到讲述: 基于深度学习的图像描述研究综述论文，From Show to Tell: A Survey on Deep Learning-based Image Captioning

【TPAMI2022】从展示到讲述: 基于深度学习的图像描述研究综述论文，From Show to Tell: A Survey on Deep Learning-based Image Captioning

专知会员服务

24+阅读 · 2022年3月1日

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

【ACL2022-华盛顿大学】生成知识促进常识推理，Generated Knowledge Prompting for Commonsense Reasoning

专知会员服务

26+阅读 · 2022年3月1日

最新《图像描述Image Captioning》综述论文，22页pdf220篇文献

专知会员服务

43+阅读 · 2021年7月17日

近期必读的七篇AAAI 2021【问答（QA）】相关论文和代码

专知会员服务

55+阅读 · 2021年2月2日

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

1750亿参数！GPT-3来了！31位作者，OpenAI发布小样本学习器语言模型

专知会员服务

73+阅读 · 2020年5月30日

【CVPR2020-英伟达】从图像集合中学习自监督视点，Self-Supervised Viewpoint Learning From Image Collections

【CVPR2020-英伟达】从图像集合中学习自监督视点，Self-Supervised Viewpoint Learning From Image Collections

专知会员服务

24+阅读 · 2020年4月4日

【NLP| 推荐文章】基于知识库的问答系统关键技术综述（Core techniques of question answering systems over knowledge bases：a survey）

专知会员服务

47+阅读 · 2019年11月24日

热门VIP内容

开通专知VIP会员享更多权益服务

《复杂工程系统模型驱动设计决策支持系统：早期设计阶段挑战》最新138页

《日本陆上自卫队2040年作战方式与未来作战研究》最新23页slides

人工智能作为战争武器

《后勤保障》最新23页

相关资讯

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

IJCAI 2022 | 使用陈述句进行视觉问答的Prompt Tuning

PaperWeekly

3+阅读 · 2022年9月21日

NAACL 2022 | 基于Prompt的文本生成迁移学习

NAACL 2022 | 基于Prompt的文本生成迁移学习

PaperWeekly

1+阅读 · 2022年8月31日

论文浅尝 | KM-BART：用于视觉常识生成的知识增强多模态BART

论文浅尝 | KM-BART：用于视觉常识生成的知识增强多模态BART

开放知识图谱

0+阅读 · 2022年5月29日

【CV+NLP】更有智慧的眼睛：图像描述（Image Caption）&视觉问答（VQA）综述（上）

【CV+NLP】更有智慧的眼睛：图像描述（Image Caption）&视觉问答（VQA）综述（上）

极市平台

79+阅读 · 2019年1月20日

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

【泡泡一分钟】DS-SLAM: 动态环境下的语义视觉SLAM

泡泡机器人SLAM

23+阅读 · 2019年1月18日

Image Captioning 36页最新综述， 161篇参考文献

Image Captioning 36页最新综述， 161篇参考文献

专知

90+阅读 · 2018年10月23日

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

专知

10+阅读 · 2018年4月12日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【干货】基于属性学习和额外知识库的图像描述生成和视觉问答

【干货】基于属性学习和额外知识库的图像描述生成和视觉问答

专知

18+阅读 · 2017年12月25日

相关论文

Privacy-Preserving Prompt Tuning for Large Language Model Services

Arxiv

0+阅读 · 2023年5月10日

Large Language Models Need Holistically Thought in Medical Conversational QA

Arxiv

0+阅读 · 2023年5月10日

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Arxiv

0+阅读 · 2023年5月9日

Question-controlled Text-aware Image Captioning

Arxiv

10+阅读 · 2021年8月4日

From Show to Tell: A Survey on Image Captioning

Arxiv

15+阅读 · 2021年7月14日

Exploring Visual Relationship for Image Captioning

Exploring Visual Relationship for Image Captioning

Arxiv

15+阅读 · 2018年9月19日

CNN+CNN: Convolutional Decoders for Image Captioning

Arxiv

21+阅读 · 2018年5月23日

Image Captioning

Arxiv

11+阅读 · 2018年5月13日

End-to-End Dense Video Captioning with Masked Transformer

Arxiv

14+阅读 · 2018年4月3日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

相关基金

基于天然产物Aspernigerin的新型几丁质合成抑制剂的设计、合成及生物活性研究

国家自然科学基金

0+阅读 · 2014年12月31日

大规模汉语历时语料库建设及词汇语义变迁研究

国家自然科学基金

1+阅读 · 2014年12月31日

中心反折射全景相机标定- - 共形几何代数方法

国家自然科学基金

0+阅读 · 2013年12月31日

基于Ontology的藏文语料库检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于不饱和腈合成子的萘啶骨架定向构筑

国家自然科学基金

0+阅读 · 2012年12月31日

多波束测深声纳孔径合成机理与三维高分辨探测技术

国家自然科学基金

0+阅读 · 2012年12月31日

风格化人体运动合成新方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于语言理解的机器翻译方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

基于边缘点的折反射图像立体匹配与三维重建研究

国家自然科学基金

0+阅读 · 2009年12月31日

应用于面向问题的自动文摘任务的篇章分析关键技术研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员