Situat3DChange：面向多模态大语言模型的场景化三维变化理解数据集 (Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model)

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.

翻译：物理环境和情境本质上是动态的，然而当前的三维数据集和评估基准往往孤立地关注动态场景或动态情境，导致理解不完整。为克服这些限制，我们引入了Situat3DChange，这是一个支持三种遵循感知-行动模型的场景感知变化理解任务的大规模数据集：包含12.1万个问答对、3.6万个用于感知任务的变化描述以及1.7万个用于行动任务的重排指令。为构建此大规模数据集，Situat3DChange利用1.1万个人类对环境变化的观察记录，以建立人机协作的共享心智模型和共享情境感知。这些观察记录通过大语言模型进行整合，并融入了以自我为中心和以客体为中心的视角，以及分类和坐标空间关系，以支持对场景化变化的理解。为应对比较同一场景中发生细微变化的两组点云这一挑战，我们提出了SCReasoner，一种高效的三维多模态大语言模型方法，该方法能以最小的参数开销实现有效的点云比较，且无需为语言解码器引入额外标记。在Situat3DChange任务上的全面评估突显了多模态大语言模型在动态场景和情境理解方面的进展与局限。在数据扩展和跨领域迁移方面的额外实验证明了使用Situat3DChange作为多模态大语言模型训练数据集的任务无关有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日