Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.
翻译:在强化学习方面,非政策评价(OPE)在长期和无限的顶点环境中非常困难,因为行为和目标政策之间的重叠日益减少。在本文中,我们研究了Markovian和时间差异结构在高效的OPE中所起的作用。我们首先在一个人承担每个结构时为OPE得出效率界限。这恰恰是地平线诅咒的特点:在时间变化过程中,OPE只有在接近政策的环境中才可行,因为行为和目标政策非常相似。但在时间变化的Markov决策过程中,我们的界限表明,真正的非政策评价是可行的,即使只有一条依赖性的轨迹,并且提供了我们希望做得好的程度。我们开发了一个新的基于双增强学习(DRL)的估算器,利用高效影响功能将这一结构用于OPE。我们的DRL估测器同时使用估计的固定密度比率和美元功能。在估计缓慢、非偏差率的情况下,如果估计两种方法都采用低速、非偏差率,并且在对OPE结构进行更一致的估计时,我们对这些特性和效益进行持续调查。