State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., $\epsilon$-greedy) for exploration, but this method fails on hard exploration tasks like Montezuma's Revenge. To address the challenge of exploration, prior works incentivize exploration by rewarding the agent when it visits novel states. Such intrinsic rewards (also called exploration bonus or curiosity) often lead to excellent performance on hard exploration tasks. However, on easy exploration tasks, the agent gets distracted by intrinsic rewards and performs unnecessary exploration even when sufficient task (also called extrinsic) reward is available. Consequently, such an overly curious agent performs worse than an agent trained with only task reward. Such inconsistency in performance across tasks prevents the widespread use of intrinsic rewards with RL algorithms. We propose a principled constrained optimization procedure called Extrinsic-Intrinsic Policy Optimization (EIPO) that automatically tunes the importance of the intrinsic reward: it suppresses the intrinsic reward when exploration is unnecessary and increases it when exploration is required. The results is superior exploration that does not require manual tuning in balancing the intrinsic reward against the task reward. Consistent performance gains across sixty-one ATARI games validate our claim. The code is available at https://github.com/Improbable-AI/eipo.
翻译:最先进的强化学习(RL)算法通常使用随机抽样(例如,$\epsilon$-greedy)进行勘探,但这种方法在诸如Montezuma的Revenge等硬性勘探任务上失败。为了应对勘探的挑战,先前的工程通过在代理人访问新国家时给予奖励来激励勘探。这些内在的奖励(也称为勘探奖金或好奇心)往往导致在艰苦勘探任务上取得出色表现。然而,在简单勘探任务上,代理人会因内在报酬而分心,进行不必要的勘探,即使有足够的(也称为外部)奖励。因此,这种过于好奇的代理人的表现比只受过任务奖励的代理人要差。任务之间的这种工作不协调使得无法广泛使用RL算法的内在奖励。我们提议了一种有原则的限制性优化程序,称为Extrinsic-Intrinsic Polical Oppimimation(EIPO),自动调整内在报酬的重要性:当勘探没有必要时,它就会抑制内在的奖励,并且在需要勘探时会增加这种奖励。结果是高级的勘探结果,在60年/AI游戏上不需要手动的奖励。