We present a general, two-stage reinforcement learning approach to create robust policies that can be deployed on real robots without any additional training using a single demonstration generated by trajectory optimization. The demonstration is used in the first stage as a starting point to facilitate initial exploration. In the second stage, the relevant task reward is optimized directly and a policy robust to environment uncertainties is computed. We demonstrate and examine in detail the performance and robustness of our approach on highly dynamic hopping and bounding tasks on a quadruped robot.
翻译:我们提出了一个一般的、分为两个阶段的强化学习方法,以制定强有力的政策,在不经过任何额外培训的情况下,利用轨迹优化产生的单一演示方法在真正的机器人上部署。演示在第一阶段作为起点,以促进初步探索。在第二阶段,直接优化相关任务奖励,并计算出一项对环境不确定因素具有活力的政策。我们详细展示和审查我们对于高度动态选择和捆绑于四重机器人上的任务的做法的绩效和稳健性。