Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or ReLU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.
翻译:政策梯度(PG) 估计是一个挑战,当我们不允许与目标政策进行抽样,而只能获取由某些未知行为政策产生的数据集时,我们便会遇到挑战。 常规的非政策PG估计方法往往有显著的偏差或极大的差异。 在本文中,我们建议采用双相配的PG估计算法。 FPG可以使用任意的政策参数化,假设可以使用贝尔曼-完整值函数级。 在线性值函数近似值的情况下,我们提供严格限值表或ReLU政策估计误差的上限,该误差受特征空间测量的分布不匹配量制约。 我们还建立了FPG估计错误的无规律性常态性,并有一个精确的共变性特征。 在统计上,这进一步显示与一个匹配的Cramer-Rao较低约束值值值值值值值值值值相匹配是最佳的。 我们使用软模度表表或ReLU政策网来评估FPG在政策梯度估计和政策优化方面的绩效。 在各种指标下,我们的结果显示FPG明显优于基于重要性和降低差异的技术的现有非政策PG估计方法。