Deep reinforcement learning has emerged as a popular and powerful way to develop locomotion controllers for quadruped robots. Common approaches have largely focused on learning actions directly in joint space, or learning to modify and offset foot positions produced by trajectory generators. Both approaches typically require careful reward shaping and training for millions of time steps, and with trajectory generators introduce human bias into the resulting control policies. In this paper, we present a learning framework that leads to the natural emergence of fast and robust bounding policies for quadruped robots. The agent both selects and controls actions directly in task space to track desired velocity commands subject to environmental noise including model uncertainty and rough terrain. We observe that this framework improves sample efficiency, necessitates little reward shaping, leads to the emergence of natural gaits such as galloping and bounding, and eases the sim-to-real transfer at running speeds. Policies can be learned in only a few million time steps, even for challenging tasks of running over rough terrain with loads of over 100% of the nominal quadruped mass. Training occurs in PyBullet, and we perform a sim-to-sim transfer to Gazebo and sim-to-real transfer to the Unitree A1 hardware. For sim-to-sim, our results show the quadruped is able to run at over 4 m/s without a load, and 3.5 m/s with a 10 kg load, which is over 83% of the nominal quadruped mass. For sim-to-real, the Unitree A1 is able to bound at 2 m/s with a 5 kg load, representing 42% of the nominal quadruped mass.
翻译:深度加固学习已成为为四重机器人开发移动控制器的流行和强大方法。 常见方法主要侧重于直接学习在联合空间直接学习动作, 或学习改变和抵消轨迹生成者产生的脚姿势。 这两种方法通常都需要为数百万个时间步骤提供谨慎的奖赏和训练, 而轨迹生成者则将人类偏见引入由此形成的控制政策。 在本文中, 我们提出了一个学习框架, 导致四重机器人快速和稳健的捆绑政策自然出现。 代理商在任务空间直接选择和控制行动, 以跟踪需要的速度指令, 包括模型不确定性和粗地形。 我们观察到, 这个框架提高了取样效率, 需要微微的奖赏塑造, 导致自然动作的出现, 如飞速和捆绑起来, 并且随着运行速度的快速转换。 政策只能在几百万个步骤中学习, 即使是在超过100 % 的标称四重质量的地形上运行。 培训在PyBullet进行, 并且我们用一个正值的硬值的重量转换, 将一个正值转换成一个正值的单位, 。</s>