Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild. Results, visualizations and videos at https://robo-affordances.github.io/
翻译:在机器人学习和相应视觉问题方面,研究如何让机器人通过观察人类进行理解和学习的工作已经启发了许多工作。尽管在静态数据集上取得了一定的成功,但目前的模型在实际机器人应用中的使用尚不清楚。在本文中,我们旨在通过基于人类交互视频的环境属性的方法来填补这一空白。利用互联网上的人类行为视频,我们训练了一种视觉物体辨识模型,该模型能够估计人类可能在场景中的哪个位置进行交互以及如何进行交互。这些交互属性的结构直接使机器人能够执行许多复杂任务。我们展示了如何将我们的交互属性模型与四种机器人学习范式(包括脱机模仿学习、探索、目标条件学习和增强学习中的动作参数化)无缝地结合在一起。我们展示了我们的方法,在4个真实环境,10个不同任务和2个在野外操作的机器人平台上的有效性。结果、可视化和视频请参考 https://robo-affordances.github.io/ 。