Controlling a non-statically bipedal robot is challenging due to the complex dynamics and multi-criterion optimization involved. Recent works have demonstrated the effectiveness of deep reinforcement learning (DRL) for simulation and physical robots. In these methods, the rewards from different criteria are normally summed to learn a single value function. However, this may cause the loss of dependency information between hybrid rewards and lead to a sub-optimal policy. In this work, we propose a novel reward-adaptive reinforcement learning for biped locomotion, allowing the control policy to be simultaneously optimized by multiple criteria using a dynamic mechanism. The proposed method applies a multi-head critic to learn a separate value function for each reward component. This leads to hybrid policy gradient. We further propose dynamic weight, allowing each component to optimize the policy with different priorities. This hybrid and dynamic policy gradient (HDPG) design makes the agent learn more efficiently. We show that the proposed method outperforms summed-up-reward approaches and is able to transfer to physical robots. The sim-to-real and MuJoCo results further demonstrate the effectiveness and generalization of HDPG.
翻译:由于复杂的动态和多标准优化所涉及的复杂动态和多标准优化,控制非双极机器人具有挑战性。最近的工作表明,模拟机器人和物理机器人的深度强化学习(DRL)是有效的。在这些方法中,不同标准的奖励通常被总结为学习单一值函数。然而,这可能导致混合奖赏之间失去依赖性信息,并导致一个亚优政策。在这项工作中,我们提议对双向移动进行新的奖励性强化学习,允许使用动态机制以多重标准同时优化控制政策。拟议方法应用多头评论员为每个奖赏部分学习单独的价值函数。这导致混合政策梯度。我们进一步提出动态权重,允许每个部分以不同优先事项优化政策。这种混合和动态政策梯度设计使代理方学习效率更高。我们显示,拟议方法优于加和反向方法,能够向物理机器人转移。Sim-toReal和MuJoco的结果进一步证明了HDPG的有效性和一般化。