Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.
翻译:物理环境和情境本质上是动态的,然而当前的三维数据集和评估基准往往孤立地关注动态场景或动态情境,导致理解不完整。为克服这些限制,我们引入了Situat3DChange,这是一个支持三种遵循感知-行动模型的场景感知变化理解任务的大规模数据集:包含12.1万个问答对、3.6万个用于感知任务的变化描述以及1.7万个用于行动任务的重排指令。为构建此大规模数据集,Situat3DChange利用1.1万个人类对环境变化的观察记录,以建立人机协作的共享心智模型和共享情境感知。这些观察记录通过大语言模型进行整合,并融入了以自我为中心和以客体为中心的视角,以及分类和坐标空间关系,以支持对场景化变化的理解。为应对比较同一场景中发生细微变化的两组点云这一挑战,我们提出了SCReasoner,一种高效的三维多模态大语言模型方法,该方法能以最小的参数开销实现有效的点云比较,且无需为语言解码器引入额外标记。在Situat3DChange任务上的全面评估突显了多模态大语言模型在动态场景和情境理解方面的进展与局限。在数据扩展和跨领域迁移方面的额外实验证明了使用Situat3DChange作为多模态大语言模型训练数据集的任务无关有效性。