VP-Bench：多模态大语言模型中视觉提示的综合基准 (VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models)

Mingjie Xu,Jinpeng Chen,Yuzhi Zhao,Jason Chun Lok Li,Yue Qiu,Zekang Du,Mengyang Wu,Pingping Zhang,Kun Li,Hongzheng Yang,Wenao Ma,Jiaheng Wei,Qinbin Li,Kangcheng Liu,Wenqiang Lei

from arxiv, This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details

Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.

翻译：多模态大语言模型（MLLMs）已支持广泛的先进视觉语言应用，包括细粒度物体识别和上下文理解。当查询图像中的特定区域或物体时，人类用户自然地使用“视觉提示”（VPs），例如边界框，来提供参考。然而，现有基准中缺乏对MLLMs解释此类VPs能力的系统性评估。这一空白使得当前MLLMs是否能有效识别VPs（一种对人类而言直观的提示方法）并利用其解决问题尚不明确。为弥补这一不足，我们提出了VP-Bench，一个用于评估MLLMs在VP感知与利用方面能力的基准。VP-Bench采用两阶段评估框架：第一阶段考察模型在自然场景中感知VPs的能力，使用了涵盖八种形状和355种属性组合的30k个可视化提示；第二阶段探究VPs对下游任务的影响，测量其在现实问题解决场景中的有效性。利用VP-Bench，我们评估了28个MLLMs，包括专有系统（如GPT-4o）和开源模型（如InternVL3和Qwen2.5-VL），并对影响VP理解的因素（如VP属性变化、问题排列方式和模型规模）进行了全面分析。VP-Bench为研究MLLMs如何理解并解决基于指称的问题建立了新的参考框架。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/