Exploration under sparse reward is a long-standing challenge of model-free reinforcement learning. The state-of-the-art methods address this challenge by introducing intrinsic rewards to encourage exploration in novel states or uncertain environment dynamics. Unfortunately, methods based on intrinsic rewards often fall short in procedurally-generated environments, where a different environment is generated in each episode so that the agent is not likely to visit the same state more than once. Motivated by how humans distinguish good exploration behaviors by looking into the entire episode, we introduce RAPID, a simple yet effective episode-level exploration method for procedurally-generated environments. RAPID regards each episode as a whole and gives an episodic exploration score from both per-episode and long-term views. Those highly scored episodes are treated as good exploration behaviors and are stored in a small ranking buffer. The agent then imitates the episodes in the buffer to reproduce the past good exploration behaviors. We demonstrate our method on several procedurally-generated MiniGrid environments, a first-person-view 3D Maze navigation task from MiniWorld, and several sparse MuJoCo tasks. The results show that RAPID significantly outperforms the state-of-the-art intrinsic reward strategies in terms of sample efficiency and final performance. The code is available at https://github.com/daochenzha/rapid
翻译:在微薄的奖励下勘探是一项长期挑战,即没有模型的强化学习。最先进的方法通过引入内在奖赏来鼓励在新的状态或不确定的环境动态中进行勘探来应对这一挑战。不幸的是,基于内在奖赏的方法往往在程序产生的环境中不尽人意,因为每个事件都产生不同的环境,因此代理人不可能不止一次访问同一状态。我们采用人类通过研究整个事件来区分良好的勘探行为,我们引入了RAPID,这是程序产生的环境的一种简单而有效的事件级探索方法。RAPID将每集作为一个整体,从每个事件和长期的角度给出一个缩略微的勘探分。这些高分的片被作为良好的勘探行为对待,并存储在小的缓冲中。然后,代理人模仿缓冲中的事件,以复制过去的良好勘探行为。我们展示了我们从程序上产生的一些迷你世界的迷你Grid环境、第一个人的3D Maze导航任务以及几个稀疏的 MuJoco任务的方法。结果显示,RAPID在最终的样品/样本中明显地展示了效率战略。