通过定期奖励的组合,对所有共同的双双金的模拟到真实学习 (Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition)

We study the problem of realizing the full spectrum of bipedal locomotion on a real robot with sim-to-real reinforcement learning (RL). A key challenge of learning legged locomotion is describing different gaits, via reward functions, in a way that is intuitive for the designer and specific enough to reliably learn the gait across different initial random seeds or hyperparameters. A common approach is to use reference motions (e.g. trajectories of joint positions) to guide learning. However, finding high-quality reference motions can be difficult and the trajectories themselves narrowly constrain the space of learned motion. At the other extreme, reference-free reward functions are often underspecified (e.g. move forward) leading to massive variance in policy behavior, or are the product of significant reward-shaping via trial-and-error, making them exclusive to specific gaits. In this work, we propose a reward-specification framework based on composing simple probabilistic periodic costs on basic forces and velocities. We instantiate this framework to define a parametric reward function with intuitive settings for all common bipedal gaits - standing, walking, hopping, running, and skipping. Using this function we demonstrate successful sim-to-real transfer of the learned gaits to the bipedal robot Cassie, as well as a generic policy that can transition between all of the two-beat gaits.

翻译：我们研究如何在真正机器人上实现双足足足足足足足学习(RL)的全面两足足足足足足足足足步的问题。学习脚足足足足足足步步步步学习(RL)的一个关键挑战是通过奖赏功能描述不同的曲子,这种方式对设计者来说是直观的,足以可靠地从不同的初始随机种子或超参数中学习步态。一种共同的方法是使用参考动作(例如,联合位置的轨迹)指导学习。然而,找到高质量的参考动作可能是困难的,而轨迹本身也狭小地限制学习运动的空间。在其他极端的、无参考的奖励功能中,往往被描述得不够(例如向前移动),导致政策行为的巨大差异,或者成为通过试验和传感器进行重大奖赏的产物,使这些动作完全针对特定的阵容。在这项工作中,我们提议一个奖励的具体框架,其基础力量和速度的简单概率定期成本是简单的。我们对这一框架进行概括地界定一个准的对学习运动空间运动空间空间空间空间空间空间。在其他极端极端的、无参考性奖项(例如向前移动),将正行进的过渡功能用作共同的双行进的双行的双行的双行、正的游戏,将动作,将整个的学习的游戏式动作,将整个的游戏功能用于所有的双行进式的双行进的学习性动作,将整个的机机机机机。