在无根据的可部分观察的Markov决定程序中采用非政策评价的小型学习方法 (A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes)

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.

翻译：在部分可观测的Markov决策程序中,我们考虑政策外评价,因为评价政策仅依赖于可观测的变量,行为政策则依赖于不可观测的潜在变量。现有的工程要么不假定任何不测的混杂者,要么侧重于观测和州空间都采用表格的设置。在这项工作中,我们首先提出将目标政策的价值与观察到的数据分布联系起来的桥梁功能,为POMDP中与潜伏的相联的OPE提出新的识别方法。我们接下来提出用于学习这些桥梁功能的小型估算方法,并根据这些估计的桥梁功能,根据基于价值功能的估测仪、边缘化重要取样估测仪和双曲线估测仪,建立三个估算器。我们的提案允许一般功能近似,因此适用于连续或大型观测/州空间的设置。对拟议估算器的不适应性和不适应性特性进行了详细调查。