We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $\gamma$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.
翻译:我们用连续的状态和行动来研究Flot-Horizon Markov决定程序中的离政策评估问题。我们将Q$职能估算值重新定位为一种特殊形式的非参数性工具变量(NPIV)估算问题。我们首先显示,在一个温和条件下,NPIV对Q美元职能估算值的制定在与数据生成分布有关的不正确度量度值方面完全符合2美元2美元,绕过了最近文献中为获得各种美元职能估测员的2美元折价系数(Gamma美元)而强加的强烈假设。由于这种新定位,我们得出了第一个最低值较低的界限,用于对美元职能及其衍生物的不相称估算及其衍生物的趋同率,其含义与对数据生成分布的典型非对称性回归(Stone,1982年)。我们随后又提议,在常规的温级最低值和最低汇率约束性标准下,在常规的温级政策下确定其利率约束性。