Reinforcement learning (RL) promises to enable autonomous acquisition of complex behaviors for diverse agents. However, the success of current reinforcement learning algorithms is predicated on an often under-emphasised requirement -- each trial needs to start from a fixed initial state distribution. Unfortunately, resetting the environment to its initial state after each trial requires substantial amount of human supervision and extensive instrumentation of the environment which defeats the goal of autonomous acquisition of complex behaviors. In this work, we propose Value-accelerated Persistent Reinforcement Learning (VaPRL), which generates a curriculum of initial states such that the agent can bootstrap on the success of easier tasks to efficiently learn harder tasks. The agent also learns to reach the initial states proposed by the curriculum, minimizing the reliance on human interventions into the learning. We observe that VaPRL reduces the interventions required by three orders of magnitude compared to episodic RL while outperforming prior state-of-the art methods for reset-free RL both in terms of sample efficiency and asymptotic performance on a variety of simulated robotics problems.
翻译:强化学习(RL) 有望使各种代理人能够自主获得复杂的行为。然而,当前强化学习算法的成功取决于往往不够强调的要求 -- -- 每次试验都需要从固定的初始状态分布开始。 不幸的是,在每次试验之后,将环境调整到最初状态需要大量的人力监督和广泛的环境仪器,这不利于自主获取复杂行为的目标。 在这项工作中,我们提议了增值加速持久性强化学习(VaPRL),这产生了初始状态的课程,使代理人能够抓住较容易完成的任务的成功,从而有效学习更艰巨的任务。该代理人还学会达到课程提议的初始状态,尽量减少对人类干预的依赖,进入学习阶段。我们注意到VaPRL在样品效率和各种模拟机器人问题的微弱表现上,比Sadsodic RL少了三种规模的干预,而比Sindic RL少了先前的不设新RL最先进的方法。