Reinforcement Learning (RL) has seen many recent successes for quadruped robot control. The imitation of reference motions provides a simple and powerful prior for guiding solutions towards desired solutions without the need for meticulous reward design. While much work uses motion capture data or hand-crafted trajectories as the reference motion, relatively little work has explored the use of reference motions coming from model-based trajectory optimization. In this work, we investigate several design considerations that arise with such a framework, as demonstrated through four dynamic behaviours: trot, front hop, 180 backflip, and biped stepping. These are trained in simulation and transferred to a physical Solo 8 quadruped robot without further adaptation. In particular, we explore the space of feed-forward designs afforded by the trajectory optimizer to understand its impact on RL learning efficiency and sim-to-real transfer. These findings contribute to the long standing goal of producing robot controllers that combine the interpretability and precision of model-based optimization with the robustness that model-free RL-based controllers offer.
翻译:强化学习(RL)最近在四重机器人控制方面取得了许多成功。 仿照参考动议为引导解决方案实现理想解决方案提供了简单而有力的前奏,无需细心的奖赏设计。 虽然许多工作使用运动捕获数据或手工制作的轨迹作为参考动作,但相对较少的工作探索使用来自模型轨迹优化的参考动作。 在这项工作中,我们调查了在这种框架下产生的几个设计考虑,这表现在四种动态行为上:Trot、前跳、180次后翻和双跃。这些动作是模拟训练的,并转让给一个物理的Solo 8号四重机器人,而无需进一步调整。特别是,我们探索了轨道优化者提供的前进设计空间,以了解其对于基于模型的轨迹优化学习效率和简易到真实传输的影响。这些发现有助于实现制作机器人控制器的长期目标,将模型优化的可解释性和精确性与无模型的RL控制器所提供的坚固性结合起来。