Reinforcement learning (RL) promises to enable autonomous acquisition of complex behaviors for diverse agents. However, the success of current reinforcement learning algorithms is predicated on an often under-emphasised requirement -- each trial needs to start from a fixed initial state distribution. Unfortunately, resetting the environment to its initial state after each trial requires substantial amount of human supervision and extensive instrumentation of the environment which defeats the purpose of autonomous reinforcement learning. In this work, we propose Value-accelerated Persistent Reinforcement Learning (VaPRL), which generates a curriculum of initial states such that the agent can bootstrap on the success of easier tasks to efficiently learn harder tasks. The agent also learns to reach the initial states proposed by the curriculum, minimizing the reliance on human interventions into the learning. We observe that VaPRL reduces the interventions required by three orders of magnitude compared to episodic RL while outperforming prior state-of-the art methods for reset-free RL both in terms of sample efficiency and asymptotic performance on a variety of simulated robotics problems.
翻译:强化学习(RL)有望使各种代理人能够自主掌握复杂的行为。然而,当前强化学习算法的成功取决于往往不够强调的要求 -- -- 每一次试验都需要从固定的初始状态分布开始。不幸的是,在每次试验后,将环境重新定位到最初状态需要大量的人力监督和广泛的环境仪器,这不利于自主强化学习的目的。在这项工作中,我们提议了增值加速的持久性强化学习(VaPRL),它产生了初始状态课程,使代理人能够抓住较容易完成的任务的成功,从而有效学习更艰巨的任务。该代理人还学会达到课程提议的初始状态,尽量减少对人类干预的依赖,进入学习阶段。我们观察到VaPRL在样品效率和各种模拟机器人问题无菌性表现方面,比流行RL少了前最先进的再设RL方法。