We aim to teach robots to perform simple object manipulation tasks by watching a single video demonstration. Towards this goal, we propose an optimization approach that outputs a coarse and temporally evolving 3D scene to mimic the action demonstrated in the input video. Similar to previous work, a differentiable renderer ensures perceptual fidelity between the 3D scene and the 2D video. Our key novelty lies in the inclusion of a differentiable approach to solve a set of Ordinary Differential Equations (ODEs) that allows us to approximately model laws of physics such as gravity, friction, and hand-object or object-object interactions. This not only enables us to dramatically improve the quality of estimated hand and object states, but also produces physically admissible trajectories that can be directly translated to a robot without the need for costly reinforcement learning. We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something. Our approach improves over previous state-of-the-art by almost 30%, demonstrating superior quality on especially challenging actions involving physical interactions of two objects such as put something onto something. Finally, we showcase the learned skills on a Franka Emika Panda robot.
翻译:我们的目标是通过观看一个视频演示来教机器人执行简单的天体操纵任务。 为了实现这一目标, 我们提出一个优化方法, 输出粗略和时间演变的 3D 场景来模仿输入视频中显示的动作。 和以前的工作相似, 一个不同的铸造器可以确保 3D 场景和 2D 视频之间的感知真实性。 我们的关键新颖之处在于包含一个不同的方法, 以解决一套普通差异( ODE), 从而让我们可以大致了解一些物理的示范法则, 比如重力、 摩擦、 手球或对象或对象对象点互动。 这不仅使我们能够极大地提高估计的手和对象状态的质量, 而且还能产生可以直接转化为机器人的可接受轨迹, 而不需要花费昂贵的强化学习。 我们评估了我们关于3D 重建任务的方法, 包括54个视频演示, 由9个动作产生, 比如从右向左拉动到左, 或者将东西放在前方。 我们的方法比以前的物理状态法则改进了近30%, 展示了高品质, 特别具有挑战性的行动质量 。 最后, 包括了两个机器人的物理互动。