In recent years, a number of reinforcement learning (RL) methods have been proposed to explore complex environments which differ across episodes. In this work, we show that the effectiveness of these methods critically relies on a count-based episodic term in their exploration bonus. As a result, despite their success in relatively simple, noise-free settings, these methods fall short in more realistic scenarios where the state space is vast and prone to noise. To address this limitation, we introduce Exploration via Elliptical Episodic Bonuses (E3B), a new method which extends count-based episodic bonuses to continuous state spaces and encourages an agent to explore states that are diverse under a learned embedding within each episode. The embedding is learned using an inverse dynamics model in order to capture controllable aspects of the environment. Our method sets a new state-of-the-art across 16 challenging tasks from the MiniHack suite, without requiring task-specific inductive biases. E3B also matches existing methods on sparse reward, pixel-based VizDoom environments, and outperforms existing methods in reward-free exploration on Habitat, demonstrating that it can scale to high-dimensional pixel-based observations and realistic environments.
翻译:近年来,我们提出了一些强化学习(RL)方法,以探索不同时段的复杂环境。在这项工作中,我们表明这些方法的有效性关键地取决于其勘探奖金中的基于点数的偶数术语。因此,尽管这些方法在相对简单、无噪音的环境中取得了成功,但是在比较简单、无噪音的环境中,这些方法在比较现实的情景中却落后于更为现实的情景,因为国家空间广且容易引起噪音。为了解决这一限制,我们引入了通过 Elliptical Episodic Bonuses(E3B) 探索,这是一种新方法,将基于计数的分数的奖金扩展到持续的国家空间,并鼓励一个代理探索在每一时段内所学过的不同状态。这种嵌入是利用一种反向动态模型来捕捉到环境的可控方面。我们的方法在小型哈克套房的16项具有挑战性的任务中设置了新的状态,而不需要特定任务的感性偏差。E3B还匹配了现有关于微奖赏、基于像机的VizDomoom环境的方法,并超越了当前在生境上进行无报酬的探索的方法,显示它能到高尺度。