Can we learn robot manipulation for everyday tasks, only by watching videos of humans doing arbitrary tasks in different unstructured settings? Unlike widely adopted strategies of learning task-specific behaviors or direct imitation of a human video, we develop a a framework for extracting agent-agnostic action representations from human videos, and then map it to the agent's embodiment during deployment. Our framework is based on predicting plausible human hand trajectories given an initial image of a scene. After training this prediction model on a diverse set of human videos from the internet, we deploy the trained model zero-shot for physical robot manipulation tasks, after appropriate transformations to the robot's embodiment. This simple strategy lets us solve coarse manipulation tasks like opening and closing drawers, pushing, and tool use, without access to any in-domain robot manipulation trajectories. Our real-world deployment results establish a strong baseline for action prediction information that can be acquired from diverse arbitrary videos of human activities, and be useful for zero-shot robotic manipulation in unseen scenes.
翻译:我们能否学习机器人对日常任务的操纵,只能通过观看人类在不同非结构化环境中任意任务的视频来学习?与广泛采用的学习特定任务行为或直接模仿人类视频的战略不同,我们开发了一个框架,从人类视频中提取代理-不可知的行为表现,然后将它映射到代理方的化身。我们的框架基于预测人类手动轨迹,并给出了场景的初步图像。在对互联网上多种人类视频进行这一预测模型的培训之后,我们为物理机器人操作任务应用了经过培训的模型零弹式,在机器人的化身上进行了适当的改造之后。这一简单战略让我们解决了粗糙的操作任务,比如打开和关闭抽屉、推推和工具使用,而没有机会获得任何内部机器人操纵轨迹。我们真实世界的部署成果为从人类活动的各种任意视频中获取的行动预测信息建立了一个强大的基线,并且对于在看不见的场景中零光机动操纵很有用。