MMPerspective：多模态大语言模型是否理解透视？一个用于透视感知、推理与鲁棒性的综合基准 (MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness)

Yolo Yunlong Tang,Pinxin Liu,Zhangyun Tan,Mingqian Feng,Rui Mao,Chao Huang,Jing Bi,Yunzhong Xiao,Susan Liang,Hang Hua,Ali Vosoughi,Luchuan Song,Zeliang Zhang,Chenliang Xu

from arxiv, Accepted to NeurIPS 2025 DB Track

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

翻译：理解透视是人类视觉感知的基础，然而多模态大语言模型（MLLMs）在多大程度上内化了透视几何仍不明确。我们提出了MMPerspective，这是首个专门为系统评估MLLMs对透视的理解而设计的基准，通过三个互补维度（透视感知、推理与鲁棒性）下的10项精心设计的任务来实现。该基准包含2,711个真实世界与合成图像实例及5,083个问答对，用于探究关键能力，如消失点感知与计数、透视类型推理、三维空间中线关系理解、对透视保持变换的不变性等。通过对43个前沿MLLMs的综合评估，我们揭示了显著的局限性：尽管模型在表层感知任务上表现出一定能力，但在组合推理及受扰动下保持空间一致性方面存在困难。我们的分析进一步揭示了模型架构、规模与透视能力之间的有趣关联，既指出了鲁棒性瓶颈，也凸显了思维链提示的益处。MMPerspective为诊断和推进视觉-语言系统的空间理解建立了一个有价值的测试平台。相关资源发布于：https://yunlong10.github.io/MMPerspective/