In an Markov decision process (MDP), unobservable confounders may exist and have impacts on the data generating process, so that the classic off-policy evaluation (OPE) estimators may fail to identify the true value function of the target policy. In this paper, we study the statistical properties of OPE in confounded MDPs with observable instrumental variables. Specifically, we propose a two-stage estimator based on the instrumental variables and establish its statistical properties in the confounded MDPs with a linear structure. For non-asymptotic analysis, we prove a $\mathcal{O}(n^{-1/2})$-error bound where $n$ is the number of samples. For asymptotic analysis, we prove that the two-stage estimator is asymptotically normal with a typical rate of $n^{1/2}$. To the best of our knowledge, we are the first to show such statistical results of the two-stage estimator for confounded linear MDPs via instrumental variables.
翻译:在Markov决策程序中,可能存在无法观察的混淆者,并会影响数据生成过程,因此经典的离政策评价估计者可能无法确定目标政策的真正价值功能。在本文中,我们研究了POP在与可观测的工具变量混为一体的 MDP中的统计属性。具体地说,我们建议基于工具变量的两阶段估计器,并在具有线性结构的相混MDP中确立其统计属性。对于非抽取分析,我们证明一个$\mathcal{O}(n ⁇ -1/2})$-eror绑定,其中样本数为n美元。在抽查分析中,我们证明两阶段估计器的统计特性与典型的 $n ⁇ 1/2} 的典型比率无异。我们最了解的是,我们首先展示了通过工具变量对相匹配的线性 mDP的两阶段估测器的统计结果。