Offline policy evaluation (OPE) is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is semi-parametric efficient under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we make a first attempt towards characterizing the difficulty of OPE problems, which may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
翻译:在加强学习(RL)方面,人们认为离线政策评价(OPE)是一个根本性的、具有挑战性的问题。本文件侧重于基于在无穷处马科夫决策程序的框架内,从可能的不同政策中产生的预收集数据,对目标政策进行价值估计。受最近开发的在RL中边际重要性抽样方法以及因果推论中共变平衡概念的驱动,我们提出了一个新的估计标准,其中大致预测了政策价值估计的州-行动平衡权重。我们获得了这些权重的趋同率,并表明拟议的价值估计器在技术条件下是半参数效率的。在随机学方面,我们的结果规模与每一轨迹的轨迹数和决定点数相比。因此,在决定点数目不同时,仍然可以与有限的几个主题保持一致。此外,我们首次尝试确定OPE问题的困难,这可能是独立感兴趣的。数量实验显示了我们拟议的估计器的有希望的性表现。