Deep reinforcement learning has emerged as a popular and powerful way to develop locomotion controllers for quadruped robots. Common approaches have largely focused on learning actions directly in joint space, or learning to modify and offset foot positions produced by trajectory generators. Both approaches typically require careful reward shaping and training for millions of time steps, and with trajectory generators introduce human bias into the resulting control policies. In this paper, we instead explore learning foot positions in Cartesian space, which we track with impedance control, for a task of running as fast as possible subject to environmental disturbances. Compared with other action spaces, we observe less needed reward shaping, much improved sample efficiency, the emergence of natural gaits such as galloping and bounding, and ease of sim-to-sim transfer. Policies can be learned in only a few million time steps, even for challenging tasks of running over rough terrain with loads of over 100% of the nominal quadruped mass. Training occurs in PyBullet, and we perform a sim-to-sim transfer to Gazebo, where our quadruped is able to run at over 4 m/s without a load, and 3.5 m/s with a 10 kg load, which is over 83% of the nominal quadruped mass. Video results can be found at https://youtu.be/roE1vxpEWfw.
翻译:深入强化学习已成为为四重机器人开发移动控制器的受欢迎和强大的途径。 共同方法主要侧重于直接在联合空间学习行动,或学习改变和抵消轨迹生成者产生的脚姿势。 这两种方法通常都需要为数百万个时间步骤提供谨慎的奖励和训练,而轨迹生成者则将人类偏见引入由此形成的控制政策。 在本文中,我们探索在卡尔提斯空间学习脚姿势,我们追踪的是阻力控制,以尽可能快地运行环境扰动。 与其他行动空间相比,我们观察到较少需要的奖赏塑造、大幅提高样本效率、像跳跃式和捆绑定式这样的自然剧团出现以及模拟到模拟转移的便利。 政策只能在几百万个时间步骤中学习,甚至对于以超过100%的标称四重质量的地形运行挑战性任务。 在PyBullet开展培训,我们执行向Gazebo的Simto-simm转移任务。 与其他行动空间相比,我们观察到了较少需要的奖赏形成、大大提高的样品效率,我们四重的动作(如飞跃式和捆)出现,例如四重/摄制式的音/平式的动作的动作的动作可以超过4米/平面负。