Deep reinforcement learning is an effective tool to learn robot control policies from scratch. However, these methods are notorious for the enormous amount of required training data which is prohibitively expensive to collect on real robots. A highly popular alternative is to learn from simulations, allowing to generate the data much faster, safer, and cheaper. Since all simulators are mere models of reality, there are inevitable differences between the simulated and the real data, often referenced as the 'reality gap'. To bridge this gap, many approaches learn one policy from a distribution over simulators. In this paper, we propose to combine reinforcement learning from randomized physics simulations with policy distillation. Our algorithm, called Distilled Domain Randomization (DiDoR), distills so-called teacher policies, which are experts on domains that have been sampled initially, into a student policy that is later deployed. This way, DiDoR learns controllers which transfer directly from simulation to reality, i.e., without requiring data from the target domain. We compare DiDoR against three baselines in three sim-to-sim as well as two sim-to-real experiments. Our results show that the target domain performance of policies trained with DiDoR is en par or better than the baselines'. Moreover, our approach neither increases the required memory capacity nor the time to compute an action, which may well be a point of failure for successfully deploying the learned controller.
翻译:深层强化学习是从零开始学习机器人控制政策的有效工具。 然而, 这些方法臭名昭著, 因为它需要大量培训数据, 而对于真实机器人来说,这些数据太昂贵了。 一个非常受欢迎的替代方法是从模拟中学习, 能够更快、更安全和更便宜地生成数据。 由于所有模拟器只是现实的模型, 模拟器和真实数据之间不可避免地存在差异, 通常被称作“ 现实差距 ” 。 为了缩小这一差距, 许多方法从模拟器的分布中学习一个政策。 在本文中, 我们提议将随机物理模拟的三种基线与政策蒸馏结合起来。 我们的算法, 叫做蒸馏多梅随机化( DiDoR), 将所谓的教师政策从最初被抽样的领域中提炼出来, 转化为后来部署的学生政策。 这样, DDoR 学习控制器可以直接从模拟到现实, 也就是说, 不需要来自目标域域的数据。 我们比较DiDoR 三个基准, 从三个随机物理模拟模拟模拟模拟中学习的基线, 和再加精炼政策。 我们的算的算算算算算算算的师政策, 而不是精确的实验, 实验需要更精确的实验。