The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.
翻译:这项工作的重点是使用模拟器进行样本效率高的深度强化学习(RL) 。 模拟器的一个有用属性是,通常很容易将环境重置到先前观察到的状态。 我们提出一个算法框架,称为不确定性-第一个本地规划(UFLP),利用这一属性。 具体地说,在每次数据收集循环中,我们元值变异性将环境切换到一个观察到的、具有高度不确定性的状态,而不是根据初始状态分布进行抽样。 代理器- 环境互动接着在标准在线RL设置中进行。 我们证明,这一简单程序可以极大地提高若干基线RL算法在困难的勘探任务方面的抽样成本。 值得注意的是,在我们的框架内,我们可以在臭名昭著的Atari游戏上实现超人性的表现, Montezuma's Revenge, 简单( 分布式) 双DQN 。 我们的工作可以被视为一种以理论保证来高效地近似地执行现有的算法,它能解释积极的实证结果。