具身思维树：基于具身世界模型的精细化操作规划 (Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model)

World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis; and (2) Reflective Branching, which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures. Website at https://embodied-tree-of-thoughts.github.io .

翻译：世界模型已成为机器人操作规划的关键组成部分，使智能体能够在执行前预测未来环境状态并推理动作的后果。尽管视频生成模型日益被采用，但它们往往缺乏严格的物理基础，导致幻觉现象以及在长时域物理约束下难以保持一致性。为应对这些局限，我们提出具身思维树（EToT），一种新颖的Real2Sim2Real规划框架，利用基于物理的交互式数字孪生作为具身世界模型。EToT将操作规划构建为通过两种协同机制扩展的树搜索：（1）先验分支，基于语义与空间分析生成多样化的候选执行路径；（2）反思分支，利用视觉语言模型在模拟器中诊断执行失败，并通过纠正动作迭代优化规划树。通过将高层推理锚定于物理模拟器，我们的框架确保生成的计划遵循刚体动力学与碰撞约束。我们在短期与长期操作任务集上验证EToT，其通过有效预测物理动态并适应潜在故障，持续超越基线方法。项目网站：https://embodied-tree-of-thoughts.github.io。