Reinforcement Learning (RL) has seen many recent successes for quadruped robot control. The imitation of reference motions provides a simple and powerful prior for guiding solutions towards desired solutions without the need for meticulous reward design. While much work uses motion capture data or hand-crafted trajectories as the reference motion, relatively little work has explored the use of reference motions coming from model-based trajectory optimization. In this work, we investigate several design considerations that arise with such a framework, as demonstrated through four dynamic behaviours: trot, front hop, 180 backflip, and biped stepping. These are trained in simulation and transferred to a physical Solo 8 quadruped robot without further adaptation. In particular, we explore the space of feed-forward designs afforded by the trajectory optimizer to understand its impact on RL learning efficiency and sim-to-real transfer. These findings contribute to the long standing goal of producing robot controllers that combine the interpretability and precision of model-based optimization with the robustness that model-free RL-based controllers offer.
翻译:翻译后的摘要:
强化学习在四足机器人控制方面取得了许多成功。参考动作的模仿为引导解决方案朝着所需解决方案的简单而强大的先验提供了可能,而无需精细的奖励设计。虽然许多工作使用运动捕捉数据或手工制作的轨迹作为参考动作,但很少有研究探索来自基于模型的轨迹优化的参考动作的使用。在此工作中,我们通过四个动态行为(小跑、前跳、180度后空翻和双足步态)在仿真环境中进行训练,并将其转移到物理的Solo 8四足机器人中,无需进一步适应。特别地,我们探究了轨迹优化器提供的前馈设计空间的影响,以了解其对机器学习效率和模拟到现实转移的影响。这些发现有助于实现结合模型优化的可解释性和精度以及模型自由强化学习的鲁棒性的机器人控制器的长期目标。