Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
翻译:多模态大语言模型(MLLMs)已支持广泛的先进视觉语言应用,包括细粒度物体识别和上下文理解。当查询图像中的特定区域或物体时,人类用户自然地使用“视觉提示”(VPs),例如边界框,来提供参考。然而,现有基准中缺乏对MLLMs解释此类VPs能力的系统性评估。这一空白使得当前MLLMs是否能有效识别VPs(一种对人类而言直观的提示方法)并利用其解决问题尚不明确。为弥补这一不足,我们提出了VP-Bench,一个用于评估MLLMs在VP感知与利用方面能力的基准。VP-Bench采用两阶段评估框架:第一阶段考察模型在自然场景中感知VPs的能力,使用了涵盖八种形状和355种属性组合的30k个可视化提示;第二阶段探究VPs对下游任务的影响,测量其在现实问题解决场景中的有效性。利用VP-Bench,我们评估了28个MLLMs,包括专有系统(如GPT-4o)和开源模型(如InternVL3和Qwen2.5-VL),并对影响VP理解的因素(如VP属性变化、问题排列方式和模型规模)进行了全面分析。VP-Bench为研究MLLMs如何理解并解决基于指称的问题建立了新的参考框架。