作为深RL勘探目标的回报预测错误 (Reward Prediction Error as an Exploration Objective in Deep RL)

A major challenge in reinforcement learning is exploration, when local dithering methods such as epsilon-greedy sampling are insufficient to solve a given task. Many recent methods have proposed to intrinsically motivate an agent to seek novel states, driving the agent to discover improved reward. However, while state-novelty exploration methods are suitable for tasks where novel observations correlate well with improved reward, they may not explore more efficiently than epsilon-greedy approaches in environments where the two are not well-correlated. In this paper, we distinguish between exploration tasks in which seeking novel states aids in finding new reward, and those where it does not, such as goal-conditioned tasks and escaping local reward maxima. We propose a new exploration objective, maximizing the reward prediction error (RPE) of a value function trained to predict extrinsic reward. We then propose a deep reinforcement learning method, QXplore, which exploits the temporal difference error of a Q-function to solve hard exploration tasks in high-dimensional MDPs. We demonstrate the exploration behavior of QXplore on several OpenAI Gym MuJoCo tasks and Atari games and observe that QXplore is comparable to or better than a baseline state-novelty method in all cases, outperforming the baseline on tasks where state novelty is not well-correlated with improved reward.

翻译：强化学习的重大挑战是探索,当地方分流方法,如epsilon-greedy采样方法不足以解决某一任务时,强化学习的一大挑战就是探索。许多最近的方法都提议在本质上激励代理人寻找新国家,促使代理人发现更好的奖励。然而,虽然州-新颖的勘探方法适合新发现与改善奖励密切相关的发现,但在这两种方法不完全相关的环境中,它们可能不会比epsilon-greedy方法更高效地探索。在本文中,我们区分了寻求新国家帮助寻找新报酬的勘探任务和没有这样做的勘探任务,例如目标确定的任务和逃避当地奖励的极限。我们提出了一个新的勘探目标,最大限度地增加为预测极端奖励而培训的价值功能的奖励预测错误。我们随后提出了一种深度强化学习方法,即QXplore方法,利用Q-功能的时间差来解决高度MDPs的硬勘探任务。我们展示了QXplore在几个Oproduclemental-core上的探索行为,在Opreal Ayplore GyMoX级游戏上没有更好的基准,在Sureal-stital Basion-stitate case-stitate case-st case asubrodudustrutislations