Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the Bellman operator in the off-policy setting, which characterizes the difficulty of OPE and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
翻译:在加强学习(RL)方面,人们认为离线政策评价(OPE)是一个根本性的、具有挑战性的问题。本文件侧重于基于在无限的Horizon Markov决策程序框架内,从可能的不同政策中产生的预收集数据,对目标政策进行价值估计。受最近开发的RL边际重要性抽样方法以及因果推论的共变平衡想法的驱动,我们提出了一个新的估计标准,其中大致预测了政策价值估计的州-行动平衡权重。我们获得了这些权重的趋同率,并表明拟议的估计值在技术条件下是半参数效率的。在无序学方面,我们的结果规模与每一轨迹的轨迹数和决定点数都一样。因此,在决定点数目不同时,仍然可以与有限的几个主题保持一致。此外,我们为确定贝尔曼经营者在离政策环境上的稳妥性创造了必要和充分的条件,这是OPE的困难特征,并且可能具有独立的兴趣。Numericalal实验展示了我们提议的有希望的表现。