Learning goal conditioned control in the real world is a challenging open problem in robotics. Reinforcement learning systems have the potential to learn autonomously via trial-and-error, but in practice the costs of manual reward design, ensuring safe exploration, and hyperparameter tuning are often enough to preclude real world deployment. Imitation learning approaches, on the other hand, offer a simple way to learn control in the real world, but typically require costly curated demonstration data and lack a mechanism for continuous improvement. Recently, iterative imitation techniques have been shown to learn goal directed control from undirected demonstration data, and improve continuously via self-supervised goal reaching, but results thus far have been limited to simulated environments. In this work, we present evidence that iterative imitation learning can scale to goal-directed behavior on a real robot in a dynamic setting: high speed, precision table tennis (e.g. "land the ball on this particular target"). We find that this approach offers a straightforward way to do continuous on-robot learning, without complexities such as reward design or sim-to-real transfer. It is also scalable -- sample efficient enough to train on a physical robot in just a few hours. In real world evaluations, we find that the resulting policy can perform on par or better than amateur humans (with players sampled randomly from a robotics lab) at the task of returning the ball to specific targets on the table. Finally, we analyze the effect of an initial undirected bootstrap dataset size on performance, finding that a modest amount of unstructured demonstration data provided up-front drastically speeds up the convergence of a general purpose goal-reaching policy. See https://sites.google.com/view/goals-eye for videos.
翻译:现实世界的学习目标控制是一个挑战性的开放机器人问题。 强化学习系统有潜力通过试镜和试镜自动学习,但在实践中,人工奖赏设计、确保安全探索和超参数调整的成本往往足以排除真正的世界部署。 模拟学习方法提供了在现实世界学习控制的一个简单方法,但通常需要昂贵的缩放演示数据,并且缺乏持续改进的机制。 最近,反复的模拟技术被展示了从非定向的演示数据中学习目标定向控制,并通过自我监督的目标达到不断改进,但迄今为止,结果仅限于模拟环境。 在这项工作中,我们提供证据表明,在动态环境下,迭代模仿学习可以达到目标导向行为:高速度,精密的表网球(例如,“在特定目标上铺设球” ) 。 我们发现,这个方法提供了一种直径直径的在机器人上学习,没有复杂性,比如奖励设计或正向正向真实传输的演示数据,但结果一直局限于模拟规模。 在现实的游戏中,我们也可以通过快速的模拟的模版的模版的模版数据采集一个精确的模标本数据, 在现实的模上,在现实的模标本上,我们可以找到一个更精准的模的模的模的模上,在总的标本的模上找到一个精确的标本的标本的标本的标值的模上, 。