Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.
翻译:多模态大语言模型(MLLMs)展现出推理潜力,但其视觉感知能力仍是一个关键瓶颈。引人注目的是,MLLMs 即使误解了关键视觉元素,仍可能生成正确答案,从而掩盖了这些潜在的失败。我们在一个联合感知-推理数据集上的初步研究表明,对于一款领先的 MLLM,其针对推理问题给出的正确答案中,仍有 29% 存在视觉感知错误。为系统性地解决此问题,我们提出了“Do You See Me”基准,这是一个包含 1,758 张图像和 2,612 个问题的可扩展基准。它涵盖了七个受人类心理学启发的二维和三维子任务,并具有可控的复杂度,以严格评估 MLLM 的视觉技能。我们对 3 个领先的闭源模型和 5 个主要开源模型的评估结果揭示了一个显著的缺陷:人类准确率达到 96.49%,而顶尖 MLLMs 的平均准确率低于 50%。随着任务复杂度的增加,这种性能差距迅速扩大(例如,在视觉形状恒常性子任务中,差距从 12% 扩大到 45%)。对根本原因的进一步分析表明,失败源于视觉注意力分配不当、以及对细粒度细节(尤其是在编码器补丁分辨率或以下级别)的内部表征不稳定等挑战。这突显了开发具有真正鲁棒视觉感知能力的 MLLMs 的迫切需求。基准数据集、源代码和评估脚本可在 https://github.com/microsoft/Do-You-See-Me 获取。