Sim-to-real transfer is a powerful paradigm for robotic reinforcement learning. The ability to train policies in simulation enables safe exploration and large-scale data collection quickly at low cost. However, prior works in sim-to-real transfer of robotic policies typically do not involve any human-robot interaction because accurately simulating human behavior is an open problem. In this work, our goal is to leverage the power of simulation to train robotic policies that are proficient at interacting with humans upon deployment. But there is a chicken and egg problem -- how do we gather examples of a human interacting with a physical robot so as to model human behavior in simulation without already having a robot that is able to interact with a human? Our proposed method, Iterative-Sim-to-Real (i-S2R), attempts to address this. i-S2R bootstraps from a simple model of human behavior and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined. We evaluate our method on a real world robotic table tennis setting, where the objective for the robot is to play cooperatively with a human player for as long as possible. Table tennis is a high-speed, dynamic task that requires the two players to react quickly to each other's moves, making a challenging test bed for research on human-robot interaction. We present results on an industrial robotic arm that is able to cooperatively play table tennis with human players, achieving rallies of 22 successive hits on average and 150 at best. Further, for 80% of players, rally lengths are 70% to 175% longer compared to the sim-to-real (S2R) baseline. For videos of our system in action, please see https://sites.google.com/view/is2r.
翻译:超到真实的传输是机器人强化学习的强大范例。 模拟政策培训能力使安全探索和大规模数据收集能够以低成本快速进行。 然而, 机器人政策的模拟到真实的传输通常不涉及任何人类机器人互动, 因为准确模拟人类行为是一个开放的问题。 在这项工作中, 我们的目标是利用模拟的力量来培训在部署时精通与人类互动的机器人政策。 但是, 存在一个鸡蛋问题。 我们如何收集人类与物理机器人进行150年期互动的例子, 从而在模拟中模拟人类行为, 而不是已经有一个机器人能够与人类进行更长时间的互动? 我们提议的机器人的模拟到真实的转换方法, 通常不涉及任何人类机器人的模拟到真实的相互作用。 i- S2R 跳板来自简单的人类行为模型和模拟和在现实世界中部署培训之间的替代。 但是在每次测试中, 人类行为模型和政策都会被精细化。 我们用真实的世界机器人的游戏台基底模型来模拟模拟人类行为, 在其中, 机器人的目标是要让人类的机器人与每个飞行器快速地进行一个具有挑战性的研究动作。