We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.
翻译:我们提出了SeePhys,一个大规模多模态基准测试,用于评估基于从中学到博士资格考试难度的物理问题的大语言模型推理能力。该基准涵盖物理学学科的7个基础领域,整合了21类高度异质化的图表。与先前研究中视觉元素主要起辅助作用不同,我们的基准包含大量视觉关键型问题(75%),这些问题必须通过视觉信息提取才能获得正确解答。通过广泛评估,我们发现即使是最先进的视觉推理模型(例如Gemini-2.5-pro和o4-mini)在我们的基准上也仅能达到低于60%的准确率。这些结果揭示了当前大语言模型在视觉理解能力方面存在根本性挑战,主要体现在:(i)建立图表解析与物理推理之间的严格耦合关系,以及(ii)克服模型对文本线索作为认知捷径的持续依赖。