We study the problem of imitating object interactions from Internet videos. This requires understanding the hand-object interactions in 4D, spatially in 3D and over time, which is challenging due to mutual hand-object occlusions. In this paper we make two main contributions: (1) a novel reconstruction technique RHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4D trajectories of both the hand and the object using 2D image cues and temporal smoothness constraints; (2) a system for imitating object interactions in a physics simulator with reinforcement learning. We apply our reconstruction technique to 100 challenging Internet videos. We further show that we can successfully imitate a range of different object interactions in a physics simulator. Our object-centric approach is not limited to human-like end-effectors and can learn to imitate object interactions using different embodiments, like a robotic arm with a parallel jaw gripper.
翻译:我们研究从互联网视频中模仿物体互动的问题。 这需要理解4D、3D和时间上的手球互动,由于相互的手球隔离,这具有挑战性。 在本文中,我们做出了两个主要贡献:(1) 新型重建技术RHOV(从视频中重建手和物体),它用 2D 图像提示和时间平滑限制来重建手球和物体的四维轨迹;(2) 一个在物理学模拟器中模拟物体互动的系统,用强化学习。我们将重建技术应用于100个挑战性因特网视频。我们进一步显示,我们可以成功地模仿物理学模拟器中的一系列不同物体互动。我们以物体为中心的方法不限于像人类一样的终端效应,还可以学习用不同的外形来模仿物体的相互作用,比如用平行的 ⁇ 抓手的机器人臂。