We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
翻译:我们研究用线性函数近似值强化学习的离政策评价问题,其目的是根据一项行为政策收集的离线性数据估计目标政策的价值功能;我们提议纳入价值功能的差异信息,以提高OPE的抽样效率。更具体地说,对于时间-异同的线性线性Markov 决策程序(MDPs),我们建议一种算法,VA-OPE,使用价值功能的估计差异来重新加权适应Q-Exeration中的Bellman剩余值。我们表明,我们的算法的错误比最已知的结果更严格。我们还对行为政策和目标政策之间的分配变化作了精细的描述。大量的数字实验证实了我们的理论。