使用深确定性政策梯度的双足行走机器人 (Bipedal Walking Robot using Deep Deterministic Policy Gradient)

Machine learning algorithms have found several applications in the field of robotics and control systems. The control systems community has started to show interest towards several machine learning algorithms from the sub-domains such as supervised learning, imitation learning and reinforcement learning to achieve autonomous control and intelligent decision making. Amongst many complex control problems, stable bipedal walking has been the most challenging problem. In this paper, we present an architecture to design and simulate a planar bipedal walking robot(BWR) using a realistic robotics simulator, Gazebo. The robot demonstrates successful walking behaviour by learning through several of its trial and errors, without any prior knowledge of itself or the world dynamics. The autonomous walking of the BWR is achieved using reinforcement learning algorithm called Deep Deterministic Policy Gradient(DDPG). DDPG is one of the algorithms for learning controls in continuous action spaces. After training the model in simulation, it was observed that, with a proper shaped reward function, the robot achieved faster walking or even rendered a running gait with an average speed of 0.83 m/s. The gait pattern of the bipedal walker was compared with the actual human walking pattern. The results show that the bipedal walking pattern had similar characteristics to that of a human walking pattern.

翻译：机器学习算法在机器人和控制系统领域发现了若干应用。控制系统社区已经开始表现出对来自子域的若干机器学习算法的兴趣, 如监督学习、模仿学习和强化学习, 以实现自主控制和智能决策。在许多复杂的控制问题中, 稳定的双足行走是最棘手的问题。在本文中, 我们提出了一个建筑来设计和模拟使用一个现实的机器人模拟器Gazebo的平板双脚行走机器人(BWW) 。机器人通过学习它的一些试验和错误, 表现出成功的行走行为, 而不事先了解它本身或世界动态。 BWR 的自主行走方式是使用称为深确定性政策梯度的强化学习算法( DDPG ) 实现的。 DDPG 是连续行动空间学习控制的一种算法。在对模型进行模拟培训后, 观察到, 有了适当的形状的奖励功能, 机器人能够更快地行走, 甚至以平均速度0. 83 m/ s 。双足行行行走器的轨迹模式与人类实际行走模式相比较。