Reinforcement learning encounters many challenges when applied directly in the real world. Sim-to-real transfer is widely used to transfer the knowledge learned from simulation to the real world. Domain randomization -- one of the most popular algorithms for sim-to-real transfer -- has been demonstrated to be effective in various tasks in robotics and autonomous driving. Despite its empirical successes, theoretical understanding on why this simple algorithm works is limited. In this paper, we propose a theoretical framework for sim-to-real transfers, in which the simulator is modeled as a set of MDPs with tunable parameters (corresponding to unknown physical parameters such as friction). We provide sharp bounds on the sim-to-real gap -- the difference between the value of policy returned by domain randomization and the value of an optimal policy for the real world. We prove that sim-to-real transfer can succeed under mild conditions without any real-world training samples. Our theory also highlights the importance of using memory (i.e., history-dependent policies) in domain randomization. Our proof is based on novel techniques that reduce the problem of bounding the sim-to-real gap to the problem of designing efficient learning algorithms for infinite-horizon MDPs, which we believe are of independent interest.
翻译:直接应用于现实世界时,强化学习会遇到许多挑战。 模拟到真实转移被广泛用于将从模拟学到的知识传授给现实世界。 域随机化 -- -- 模拟到现实转移最流行的算法之一 -- -- 已证明在机器人和自主驱动的各种任务中是有效的。 尽管它取得了一些经验,但从理论上理解这种简单算法为何起作用是有限的。 在本文中,我们提出了一个模拟到真实传输的理论框架,其中模拟器建模成一组具有可捕捉参数(对应摩擦等未知的物理参数)的MDP。 我们提供模拟到现实差距的清晰界限 -- -- 即由域随机化返回的政策价值与对现实世界的最佳政策价值之间的差别。 我们证明,在任何现实世界培训样本的温和条件下,模拟到真实的转移是可以成功的。 我们的理论还强调了在域随机化中使用记忆(即依赖历史的政策)的重要性。 我们的证据是基于创新的技术,可以减少我们所信任的磁性变真真真真真真真真真真假的学习问题。