In this paper we consider reinforcement learning tasks with progressive rewards; that is, tasks where the rewards tend to increase in magnitude over time. We hypothesise that this property may be problematic for value-based deep reinforcement learning agents, particularly if the agent must first succeed in relatively unrewarding regions of the task in order to reach more rewarding regions. To address this issue, we propose Spectral DQN, which decomposes the reward into frequencies such that the high frequencies only activate when large rewards are found. This allows the training loss to be balanced so that it gives more even weighting across small and large reward regions. In two domains with extreme reward progressivity, where standard value-based methods struggle significantly, Spectral DQN is able to make much farther progress. Moreover, when evaluated on a set of six standard Atari games that do not overtly favour the approach, Spectral DQN remains more than competitive: While it underperforms one of the benchmarks in a single game, it comfortably surpasses the benchmarks in three games. These results demonstrate that the approach is not overfit to its target problem, and suggest that Spectral DQN may have advantages beyond addressing reward progressivity.
翻译:在本文中,我们考虑通过渐进奖励来加强学习任务;也就是说,在那些奖励往往随着时间推移而增加规模的任务中,我们考虑通过渐进奖励来加强学习任务;我们假设,这种财产对于基于价值的深层次强化学习机构来说可能存在问题,特别是如果该代理人首先必须在相对不给报酬的任务区域取得成功,以便达到更富回报的区域。为了解决这一问题,我们提议Spectural DQN, 将奖励分解成频率, 以便高频率只有在发现大奖时才能激活。 这使得培训损失得到平衡, 从而在大小的奖励区域中给予更均衡的加权。 在两个具有极端的奖励性的领域, 标准的基于价值的方法在激烈挣扎, 光谱DQN 能够取得更进一步的进展。 此外, 当对一套六种不明显有利于方法的标准阿塔里游戏进行评估时, 光谱DQN 仍然比竞争力强: 它虽然在单一的游戏中低于基准之一, 但它可以令人欣慰地超过三个游戏的基准。 这些结果表明, 方法并不过分适应其目标问题,, 并且表明Spectralalality DQ可能具有超越其优势。