Robots operating in human environments must be able to rearrange objects into semantically-meaningful configurations, even if these objects are previously unseen. In this work, we focus on the problem of building physically-valid structures without step-by-step instructions. We propose StructDiffusion, which combines a diffusion model and an object-centric transformer to construct structures out of a single RGB-D image based on high-level language goals, such as "set the table." Our method shows how diffusion models can be used for complex multi-step 3D planning tasks. StructDiffusion improves success rate on assembling physically-valid structures out of unseen objects by on average 16% over an existing multi-modal transformer model, while allowing us to use one multi-task model to produce a wider range of different structures. We show experiments on held-out objects in both simulation and on real-world rearrangement tasks. For videos and additional results, check out our website: http://weiyuliu.com/StructDiffusion/.
翻译:在人类环境中运行的机器人必须能够将物体重新排列成具有语义意义的配置,即使这些物体以前是看不见的。 在这项工作中,我们侧重于在没有一步步指令的情况下建立物理有效结构的问题。 我们提议SstructDifulation, 它将一个扩散模型和一个以物体为中心的变压器结合起来, 以基于高层次语言目标( 如“ 设置表格 ” ) 的单个 RGB- D 图像来构造结构。 我们的方法显示如何将扩散模型用于复杂的多步骤 3D 规划任务。 StructDifmission 将物理有效结构从不可见物体中收集的成功率平均提高16%, 超过现有的多模式变压器模型, 同时允许我们使用一个多任务模型来产生更广泛的不同结构。 我们在模拟和真实世界的重新排列任务中展示对悬停物体的实验。 关于视频和其他结果,请查看我们的网站: http://weiuli.com/StructDifvilation/。