MomaGraph：面向具身任务规划的视觉语言模型状态感知统一场景图 (MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning)

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

翻译：家庭环境中的移动机械臂需同时具备导航与操作能力，这要求一种紧凑且语义丰富的场景表征，能够捕捉物体位置、功能属性及可交互部件。场景图作为自然选择，但现有研究常将空间关系与功能关系割裂，将场景视为缺乏物体状态或时序更新的静态快照，并忽略与当前任务最相关的信息。为突破这些局限，我们提出MomaGraph——一种面向具身智能体的统一场景表征，整合了空间-功能关联与部件级交互要素。然而，推进此类表征需要适配的数据集与严谨评估体系，这两者长期缺失。为此，我们贡献了MomaGraph-Scenes：首个面向家庭环境的大规模任务驱动精细标注场景图数据集，以及MomaGraph-Bench：涵盖从高层规划到细粒度场景理解六类推理能力的系统化评估套件。基于此基础，我们进一步开发了MomaGraph-R1——一个通过强化学习在MomaGraph-Scenes上训练的70亿参数视觉语言模型。MomaGraph-R1能够预测任务导向场景图，并在“先构图后规划”框架下作为零样本任务规划器。大量实验表明，我们的模型在开源模型中达到最先进水平，在基准测试中获得71.6%准确率（较最佳基线提升11.4%），同时在公共基准上展现良好泛化能力，并能有效迁移至真实机器人实验。