We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/
翻译:我们提出了MM-REACT,这是一种将ChatGPT与一组视觉专家集成以实现多模态推理和行动的系统范例。在本文中,我们定义并探讨了一系列先进的视觉任务,这些任务很有趣,但可能超出了现有视觉和视觉语言模型的能力范围。为了实现这种先进的视觉智能,MM-REACT引入了一种文本提示设计,该设计可以表示文本描述、文本化的空间坐标和对齐的文件名,用于密集的视觉信号,如图像和视频。MM-REACT的提示设计使语言模型能够接受、关联和处理多模态信息,从而促进了ChatGPT和各种视觉专家的协同组合。零-shot实验证明了MM-REACT在处理所需功能方面的有效性以及在需要先进视觉理解的不同场景中的广泛应用。此外,我们讨论并比较了MM-REACT的系统范例与通过联合微调扩展语言模型用于多模态场景的替代方法。代码、演示、视频和可视化可在https://multimodal-react.github.io/上找到。