Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.
翻译:数据稀缺性仍然是机器人操作领域面临的主要挑战。尽管扩散模型为生成机器人操作视频提供了有前景的解决方案,但现有方法主要依赖于二维轨迹,本质上存在三维空间模糊性问题。本研究提出名为ManipDreamer3D的新型框架,能够根据输入图像和文本指令生成具有三维感知的机器人操作视频。该方法将三维轨迹规划与基于第三人称视角重建的三维占用地图相结合,并引入创新的轨迹到视频扩散模型。具体而言,ManipDreamer3D首先从输入图像重建三维占用表征,随后计算优化的三维末端执行器轨迹,在避免碰撞的同时最小化路径长度。接着,我们采用潜在编辑技术,基于初始图像潜在表示和优化后的三维轨迹生成视频序列。该过程通过我们专门训练的轨迹到视频扩散模型,生成机器人抓取放置操作视频。本方法生成的机器人视频具有自主规划的合理三维轨迹,显著降低了人工干预需求。实验结果表明,与现有方法相比,本方法在视觉质量方面具有显著优势。