Compared to on-policy policy gradient techniques, off-policy model-free deep reinforcement learning (RL) that uses previously gathered data can improve sampling efficiency. However, off-policy learning becomes challenging when the discrepancy between the distributions of the policy of interest and the policies that collected the data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories that increases the computational complexity and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences. Moreover, their generalization to continuous action domains is strictly limited as they require action probabilities, which is unsuitable for deterministic policies. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy. Our method offers an adequate single-step off-policy correction without any probability estimates, and theoretical results show that it can achieve a contraction mapping with a fixed unique point, which allows "safe" off-policy learning. An extensive set of empirical results indicate that our algorithm substantially improves the state-of-the-art and attains higher returns in fewer steps than the competing methods by efficiently scheduling the learning rate in Q-learning and policy optimization.
翻译:与政策性梯度技术相比,使用先前收集的数据的非政策模式深度强化学习(RL)可以提高抽样效率;然而,当利益政策的分配与收集数据的政策之间的差异加大时,政策外学习就具有挑战性;虽然提出了经过广泛研究的重要性抽样和政策外政策梯度技术以弥补这种差异,但通常需要收集长轨数据,增加计算复杂性,并引起更多的问题,如消失/爆炸梯度或抛弃许多有益的经验。此外,由于需要行动概率,因此向持续行动领域的概括性受到严格限制,因为它们需要行动概率,这不适合确定性政策。为了克服这些局限性,我们采用了一种新的类似政策措施,以减轻这种差异的影响。我们的方法提供了适当的单步退出政策纠正方法,而没有任何概率估计,理论结果显示,它能够实现一个固定的独特点的收缩式绘图,从而“安全”脱政策学习。一系列广泛的实证结果表明,我们的算法大大改进了在高水平学习率和高水平学习率方面,而不是通过低的学习率步骤取得更高的回报。