Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo~\cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
翻译:大型语言模型(LLMS) 展示出对新语言任务的完美零射概观。 但是,有效使用LLMS进行零射视觉解答(VQA)仍具有挑战性,主要原因是LLM和VQA任务之间模式脱节和任务脱节。关于视觉和语言数据的端到端培训可以弥补断开,但不灵活,且计算成本昂贵。为了解决这个问题,我们提议了一个插接和播放模块,提供能够弥补上述模式和任务脱节的提示,以便LLMS能够执行零射VQA任务,而无需进行端到端的培训。为了提供这种提示,我们进一步采用LM-QA-A的全端培训模式来提供能够描述图像内容和自制问答配对的提示,这可以有效地指导LM(LM)执行零射VQA任务。 Im20 Prompt提供以下好处:(1) 与各种LMS(VA) 执行VA 2)-A-SOVROT的灵活工作可以灵活地进行 VLA-TA的升级,通过将我们最后培训的操作方法降低成本到端。