Deep Reinforcement Learning has been successfully applied to learn robotic control. However, the corresponding algorithms struggle when applied to problems where the agent is only rewarded after achieving a complex task. In this context, using demonstrations can significantly speed up the learning process, but demonstrations can be costly to acquire. In this paper, we propose to leverage a sequential bias to learn control policies for complex robotic tasks using a single demonstration. To do so, our method learns a goal-conditioned policy to control a system between successive low-dimensional goals. This sequential goal-reaching approach raises a problem of compatibility between successive goals: we need to ensure that the state resulting from reaching a goal is compatible with the achievement of the following goals. To tackle this problem, we present a new algorithm called DCIL-II. We show that DCIL-II can solve with unprecedented sample efficiency some challenging simulated tasks such as humanoid locomotion and stand-up as well as fast running with a simulated Cassie robot. Our method leveraging sequentiality is a step towards the resolution of complex robotic tasks under minimal specification effort, a key feature for the next generation of autonomous robots.
翻译:深度强化学习已成功应用于学习机器人控制。然而,当该算法应用于只有在完成复杂任务后才奖励智能体的问题时,该算法将面临困难。在这种情况下,使用演示可以显着加快学习过程,但演示可能成本很高。在本文中,我们提出了一种利用时序偏差的方法,以使用单个演示来学习复杂机器人任务的控制策略。为此,我们的方法学习了一个目标条件化策略来控制系统在低维目标之间移动。这种连续目标实现方法引发了一个问题,即需要确保达到目标后的状态与后续目标的实现相兼容。为了解决这个问题,我们提出了一种名为DCIL-II的新算法。我们展示了DCIL-II具有前所未有的样本效率,可以解决一些具有挑战性的模拟任务,如人形移动和站起来,以及使用模拟的Cassie机器人快速奔跑。我们利用时序性的方法是迈向仅需要最少规格说明就能解决复杂机器人任务的关键特性,这是下一代自主机器人的关键特性。