In this work we continue to build upon recent advances in reinforcement learning for finite Markov processes. A common approach among previous existing algorithms, both single-actor and distributed, is to either clip rewards or to apply a transformation method on Q-functions to handle a large variety of magnitudes in real discounted returns. We theoretically show that one of the most successful methods may not yield an optimal policy if we have a non-deterministic process. As a solution, we argue that distributional reinforcement learning lends itself to remedy this situation completely. By the introduction of a conjugated distributional operator we may handle a large class of transformations for real returns with guaranteed theoretical convergence. We propose an approximating single-actor algorithm based on this operator that trains agents directly on unaltered rewards using a proper distributional metric given by the Cram\'er distance. To evaluate its performance in a stochastic setting we train agents on a suite of 55 Atari 2600 games using sticky-actions and obtain state-of-the-art performance compared to other well-known algorithms in the Dopamine framework.
翻译:在这项工作中,我们继续借鉴最近为有限的马尔科夫进程开展的强化学习的最新进展。在以往现有的算法中,一个共同的办法是,要么剪辑奖赏,要么对Q功能应用一种转换方法,以处理实际折扣回报方面的大量数量。我们理论上表明,最成功的方法之一可能不会产生最佳政策,如果我们有一个非决定性的过程。作为一种解决办法,我们争辩说,分配强化学习有助于彻底纠正这种情况。通过引入一个同质分配操作器,我们可能处理大量真实回报的转换,并有保证理论趋同。我们建议以这个操作器为基础,对代理人进行不调整的奖赏培训,使用Cram\'er距离给出的适当分配指标。在一种随机环境下评价其表现,我们用粘合动作来训练一个55套Atari 2600游戏的代理,并获得与多巴胺框架中其他众所周知的算法相比的状态。