Transferring reinforcement learning policies trained in physics simulation to the real hardware remains a challenge, known as the "sim-to-real" gap. Domain randomization is a simple yet effective technique to address dynamics discrepancies across source and target domains, but its success generally depends on heuristics and trial-and-error. In this work we investigate the impact of randomized parameter selection on policy transferability across different types of domain discrepancies. Contrary to common practice in which kinematic parameters are carefully measured while dynamic parameters are randomized, we found that virtually randomizing kinematic parameters (e.g., link lengths) during training in simulation generally outperforms dynamic randomization. Based on this finding, we introduce a new domain adaptation algorithm that utilizes simulated kinematic parameters variation. Our algorithm, Multi-Policy Bayesian Optimization, trains an ensemble of universal policies conditioned on virtual kinematic parameters and efficiently adapts to the target environment using a limited number of target domain rollouts. We showcase our findings on a simulated quadruped robot in five different target environments covering different aspects of domain discrepancies.
翻译:在物理模拟中经过培训的强化学习政策向实际硬件的转移仍是一项挑战,称为“模拟到现实”差距。域随机化是解决源和目标领域动态差异的一个简单而有效的技术,但成功与否一般取决于疲劳学和试验与测试。在这项工作中,我们调查随机化参数选择对不同类型域差异的政策可转移性的影响。与在动态参数随机化的同时仔细测量运动参数的常见做法相反,我们发现,在模拟培训期间,几乎随机化运动参数(例如链接长度)一般都超过动态随机化。基于这一发现,我们引入了新的域适应算法,利用模拟运动参数变化。我们的算法、多政策Bayesian Opptimization,用数量有限的目标区域滚动参数和高效适应目标环境的通用政策的组合。我们展示了我们在五个不同目标环境中模拟四重机器人的研究结果,覆盖了区域差异的不同方面。