在部分可部分观察的Markov决策过程中采用非政策评价的小型学习方法 (A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes)

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. As such, these methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. In fully-observable MDPs, these bridge functions reduce to the familiar value functions and marginal density ratios between the evaluation and the behavior policies. We next propose minimax estimation methods for learning these bridge functions. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. Finally, we construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Their nonasymptotic and asymptotic properties are investigated in detail.

翻译：在部分可观测的Markov决策程序(POMDPs)中,我们考虑的是政策评估(OPE),在部分可观测的Markov决策程序(POMDPs)中,评价政策仅依赖于可观测的变量,行为政策则依赖于不可观测的潜在变量。现有的工程要么假设没有不测的混杂者,要么侧重于观测和州空间均采用列表表的设置。因此,这些方法要么在存在不测的混杂者时存在很大的偏差,要么在具有连续或大观测/状态空间的环境中存在很大的差异。在这项工作中,我们首先为POMDPs中与潜伏的共聚体的OPE提出新的识别方法,方法是引入连接目标政策价值和观察到的数据分布的桥梁功能。在完全可观测的 MDPs中,这些连接功能将降低到熟悉的价值功能以及评估与行为政策之间的边际密度比。我们接下来提出用于学习这些桥梁功能的微缩算方法。我们的提案允许一般功能的近似值,因此适用于连续或大型观测/状态环境。最后,我们根据这些估计的桥梁功能,我们根据这些估计的桥梁功能,根据一个价值的边际、边际、边际性、边际性、边际的精确的属性,我们根据它们作为一个基于最高价值的模型的模型的模型和底的属性,我们建造了三个。