Multi-step off-policy reinforcement learning has achieved great success. However, existing multi-step methods usually impose a fixed prior on the bootstrap steps, while the off-policy methods often require additional correction, suffering from certain undesired effects. In this paper, we propose a novel bootstrapping method, which greedily takes the maximum value among the bootstrapping values with varying steps. The new method has two desired properties:1) it can flexibly adjust the bootstrap step based on the quality of the data and the learned value function; 2) it can safely and robustly utilize data from arbitrary behavior policy without additional correction, whatever its quality or "off-policyness". We analyze the theoretical properties of the related operator, showing that it is able to converge to the global optimal value function, with a ratio faster than the traditional Bellman Optimality Operator. Furthermore, based on this new operator, we derive new model-free RL algorithms named Greedy Multi-Step Q Learning (and Greedy Multi-step DQN). Experiments reveal that the proposed methods are reliable, easy to implement, and achieve state-of-the-art performance on a series of standard benchmark datasets.
翻译:多步脱险强化政策学习取得了巨大成功。然而,现有的多步方法通常会将固定的先期数据强加给“靴子”步骤,而脱轨方法则往往需要额外的纠正,这有某些不理想的效果。在本文件中,我们建议了一种新的“靴子”方法,它贪婪地在靴子中获取最大价值,有不同的步骤。新方法有两个理想属性:1)它可以根据数据质量和所学的值函数灵活调整“靴子”步骤;2)它可以安全有力地使用任意行为政策的数据,而无需额外的纠正,不管其质量或“政策性”如何。我们分析了相关操作者的理论属性,表明它能够与全球最佳值功能趋同,比传统的“贝尔曼”最佳操作员更快。此外,我们根据这个新的操作员,我们获得了名为“Greedy Mulp-Step Q Learlement”(和Greedy 多重步骤 DQN)的新的无型RL算法。实验显示,拟议的方法可靠、易于执行,并且能够实现标准数据系列基准的状态。