The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.
翻译:政策梯度理论(Sutton等人,2000年)规定在目标政策下使用累积折扣状态分布法来估计梯度。基于此理论的多数算法实际上打破了这一假设,引入了可能导致趋同的分布变化,导致解决办法不理想。在本文中,我们提出一种新的方法,即从一开始重建政策梯度,而不需要特定抽样战略。这种形式的政策梯度计算方法可以简化为梯度评论器,由于新的Bellman梯度方程式,可以反复估算。通过使用非政策数据流的梯度批评器的时差更新,我们开发了第一个以无模式方式绕过分配转移问题的估算器。我们证明,在某些现实条件下,无论采样战略如何,我们的估量器都是不偏不倚的。我们从经验上表明,我们的技术在存在非政策抽样时,偏差交易和性能较高。