Classical control techniques such as PID and LQR have been used effectively in maintaining a system state, but these techniques become more difficult to implement when the model dynamics increase in complexity and sensitivity. For adaptive robotic locomotion tasks with several degrees of freedom, this task becomes infeasible with classical control techniques. Instead, reinforcement learning can train optimal walking policies with ease. We apply deep Q-learning and augmented random search (ARS) to teach a simulated two-dimensional bipedal robot how to walk using the OpenAI Gym BipedalWalker-v3 environment. Deep Q-learning did not yield a high reward policy, often prematurely converging to suboptimal local maxima likely due to the coarsely discretized action space. ARS, however, resulted in a better trained robot, and produced an optimal policy which officially "solves" the BipedalWalker-v3 problem. Various naive policies, including a random policy, a manually encoded inch forward policy, and a stay still policy, were used as benchmarks to evaluate the proficiency of the learning algorithm results.
翻译:PID 和 LQR 等经典控制技术被有效地用于维持系统状态,但是当模型动态的复杂度和敏感性增加时,这些技术就更难实施。对于具有若干自由度的适应性机器人移动任务,这项任务与古典控制技术不可行。相反,强化学习可以轻松地培训最佳步行政策。我们应用深Q学习和强化随机搜索(ARS)来教授模拟双维双向机器人如何使用 OpenAI Gym BipedalWalker-v3 环境行走。深Q学习没有产生高奖励政策,往往过早地融合到可能因行动空间分解而不太理想的地方最大化。但是,ARS 导致一个经过更好培训的机器人,并产生了一种正式的“溶解”BipedalWalker-v3 问题的最佳政策。各种天真的政策,包括随机政策、人工编码的先进政策以及保持政策,被用作评价学习结果熟练度的基准。