Efficiently propagating credit to responsible actions is a central and challenging task in reinforcement learning. To accelerate information propagation, this paper presents a new method that bridges a highway that allows unimpeded information to flow across long horizons. The key to our method is a newly proposed Bellman equation, called Greedy-Step Bellman Optimality Equation, through which the high-credit information can fast propagate across a long horizon. We theoretically show that the solution of the new equation is exactly the optimal value function and the corresponding operator converges faster than the classical operator. Besides, it leads to a new multi-step off-policy algorithm, which is capable of safely utilizing any off-policy data collected by the arbitrary policy. Experiments reveal that the proposed method is reliable, easy to implement. Moreover, without employing additional components of Rainbow except Double DQN, our method achieves competitive performance with Rainbow on the benchmark tasks.
翻译:高效地宣传对负责任的行动的信用是强化学习中一项核心和艰巨的任务。 为了加快信息传播,本文件提出了一个新的方法,连接一条能够让信息畅通无阻地跨过远界的高速公路。我们的方法的关键是新提出的贝尔曼方程式,叫做“贪婪-斯泰普·贝尔曼”最佳等式,高信用信息可以通过该方程式快速在长视野中传播。我们理论上表明,新方程式的解决方案恰恰是最佳价值功能,相应的操作员比古典操作员要快。此外,它导致一种新的多步骤的离政策算法,能够安全地利用任意政策收集的任何非政策数据。实验表明,拟议方法是可靠的,易于执行。此外,如果不使用彩虹的更多组成部分,除双QN外,我们的方法在基准任务上实现了彩虹的竞争性业绩。