Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
翻译:回答视觉询问是一项复杂的任务,需要视觉处理和推理。 端到端模式是这项任务的主导方法,它没有明确区分两种模式,限制可解释性和概括性。 学习模块化方案是一个有希望的替代方案,但由于同时学习程序和模块的困难而证明具有挑战性。 我们引入了ViperGPT, 这个框架利用代码生成模型将视觉和语言模型纳入子程序,为任何查询产生结果。 ViperGPT 利用所提供的 API 访问现有模块,并通过生成随后执行的 Python 代码来形成这些模块。 这一简单方法不需要进一步的培训,而是在各种复杂的视觉任务中实现最先进的结果。</s>