第一次返回,然后探索 (First return, then explore)

from arxiv, 84 pages, 25 figures, 5 tables, 3 algorithms; reorganized sections and modified SI text extensively; added reference to the published version, changed title to published title; updated entire article to match the published version; added reference to published version again

The promise of reinforcement learning is to solve complex sequential decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires thoroughly exploring the environment, but creating algorithms that can do so remains one of the central challenges of the field. We hypothesise that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states ("detachment") and from failing to first return to a state before exploring from it ("derailment"). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly remembering promising states and first returning to such states before intentionally exploring. Go-Explore solves all heretofore unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders of magnitude improvements on the grand challenges Montezuma's Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore's exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration, an insight that may prove critical to the creation of truly intelligent learning agents.

翻译：强化学习的希望是,通过只指定高层次的奖赏功能,自主解决复杂的连续决策问题。然而,当简单和直觉的奖赏通常如此,当简单和直观的奖赏往往能提供稀少和欺骗性的反馈时,强化学习算法的斗争就会提供稀有和欺骗性的反馈。避免这些陷阱需要彻底探索环境,但创造能够这样做的算法仍然是本领域的核心挑战之一。我们假设,有效探索的主要障碍来自忘记如何到达以前访问过的国家(“破坏”)的算法,以及未能在探索之前首先回到一个国家(“削弱” 。我们引入Go-Explore,这是一套直接通过明确记住有希望的国家的简单原则来应对这两项挑战的算法。 Go-Exlore解决了迄今为止所有尚未解决的阿塔里游戏,超越了所有硬探索游戏的艺术状态,从简单的改进方法到从巨大的挑战,Montezuma国家回归到探索(贬值)之前的状态。我们还展示了Go-Exlorevelore 的实用潜力,在探索过程中可以展示一个真正的探索、更高级的动作,可以进一步提升它们。