视觉语言模型是否具备跨文化心智理论推理能力？ (Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?)

Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.

翻译：心智理论（Theory of Mind，ToM）——即推断他人信念、欲望与情绪的能力——是人类社会智能的基础，但对人工智能体而言仍是重大挑战。现有视觉语言模型（Vision-Language Models，VLMs）日益应用于社会情境任务，但其跨文化心智理论推理能力尚未得到充分探索。本研究提出CulturalToM-VQA评估基准，该基准包含5095个问题，旨在通过视觉问答探究多元文化背景下的心智理论推理能力。数据集涵盖仪式、服饰、手势及人际互动等文化情境线索，支持对超越西方中心基准的心智理论推理进行系统性评估。我们通过VLM辅助的人机协同流程构建数据集：人类专家首先筛选涵盖传统、仪式与社会互动的文化丰富图像；随后由VLM辅助生成结构化心智理论场景描述，最终精炼为涵盖六类心智理论任务与四个复杂度等级的问答对。最终数据集覆盖心智状态归因、错误信念推理、非字面沟通、社会规范违背、视角协调及多智能体推理等多维度心智理论能力。