Deep reinforcement learning suggests the promise of fully automated learning of robotic control policies that directly map sensory inputs to low-level actions. However, applying deep reinforcement learning methods on real-world robots is exceptionally difficult, due both to the sample complexity and, just as importantly, the sensitivity of such methods to hyperparameters. While hyperparameter tuning can be performed in parallel in simulated domains, it is usually impractical to tune hyperparameters directly on real-world robotic platforms, especially legged platforms like quadrupedal robots that can be damaged through extensive trial-and-error learning. In this paper, we develop a stable variant of the soft actor-critic deep reinforcement learning algorithm that requires minimal hyperparameter tuning, while also requiring only a modest number of trials to learn multilayer neural network policies. This algorithm is based on the framework of maximum entropy reinforcement learning, and automatically trades off exploration against exploitation by dynamically and automatically tuning a temperature parameter that determines the stochasticity of the policy. We show that this method achieves state-of-the-art performance on four standard benchmark environments. We then demonstrate that it can be used to learn quadrupedal locomotion gaits on a real-world Minitaur robot, learning to walk from scratch directly in the real world in two hours of training.
翻译:深入强化学习表明,完全自动化地学习机器人控制政策,直接映射感官投入到低层次的行动。然而,在现实世界机器人上应用深度强化学习方法非常困难,因为样本复杂,而且同样重要的是,这些方法对超光度计的敏感度也非常高。虽然超光谱调整可以在模拟域同时进行,但在模拟域中可以同时进行,但在现实世界机器人平台上直接调试超光谱仪,特别是像四重机器人这样的四重机器人脚平台,通过广泛试验和透镜学习而损坏。在本文中,我们开发了一个软的行为体-极深层强化学习算法的稳定变方,这需要最低限度的超分光度校准,同时只需要少量的试验来学习多层神经网络政策。这一算法以最大增温学习框架为基础,并自动将探索与通过动态和自动调温度参数进行交换,从而确定该政策的随机度。我们展示了这一方法在四个标准基准环境中实现最先进的表现。我们随后展示了在真实的轨道环境中学习一个真正的机器人。