We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods. Under various combinations of realizability and completeness assumptions, we show that the minimax approach enables us to achieve a fast rate of convergence for weights and quality functions, characterized by the critical inequality \citep{bartlett2005}. Based on this result, we analyze convergence rates for OPE. In particular, we introduce novel alternative completeness conditions under which OPE is feasible and we present the first finite-sample result with first-order efficiency in non-tabular environments, i.e., having the minimal coefficient in the leading term.
翻译:我们用边际重要性重量和美元功能的功能近似值对强化学习中的非政策评价(OPE)进行理论定性,如果这些功能使用最近采用小型方法估算,则用这些功能的近似值来进行估算。 在各种可变性和完整性假设的结合下,我们表明,微量模型方法使我们能够在加权和质量功能方面实现快速的趋同率,其特点是严重的不平等 \citep{bartlett2005}。基于这一结果,我们分析了OP的趋同率。特别是,我们引入了新的替代完整性条件,根据这种条件,OPE是可行的,我们提出了第一个有限抽样结果,在非表层环境中,即具有最起码的系数。