We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length), with an exponent that depends on the overlap of the target and behavior policies, and on the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes, but strictly easier than model-free off-policy evaluation.
翻译:我们考虑在相继忽略的情况下对动态处理规则进行非政策性评估,假设基础系统可以作为部分观察到的马尔科夫决策程序(POMDP)的模式。我们提议一个估计器,部分历史重要性加权,并表明它能够一贯地估计目标政策的固定平均回报,从行为政策中抽取的时间足够长。我们提供了一个上限,以其错误在观测数量(即轨道数是其长度的倍数)上出现多位衰减,其推论取决于目标和行为政策的重叠,以及基础系统的混合时间。此外,我们表明,这种趋同率是最小的,只考虑到我们对混合和重叠的假设。我们的结果确定,在(完全观察的)马尔科夫决策过程中,POMDP的非政策评价严格来说比非政策评价更难,但绝对比无模式的离政策评价容易。