We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes, where the evaluation policy depends only on observable variables but the behavior policy depends on latent states (Tennenholtz et al. (2020a)). Prior work on this problem uses a causal identification strategy based on one-step observable proxies of the hidden state, which relies on the invertibility of certain one-step moment matrices. In this work, we relax this requirement by using spectral methods and extending one-step proxies both into the past and future. We empirically compare our OPE methods to existing ones and demonstrate their improved prediction accuracy and greater generality. Lastly, we derive a separate Importance Sampling (IS) algorithm which relies on rank, distinctness, and positivity conditions, and not on the strict sufficiency conditions of observable trajectories with respect to the reward and hidden-state structure required by Tennenholtz et al. (2020a).
翻译:在部分可观察的Markov决策程序中,我们考虑政策外评价,因为评价政策仅取决于可观察的变量,而行为政策则取决于潜在状态(Tennnholtz等人(2020年a) ) 。 先前关于该问题的工作采用了基于隐藏状态的一步可观察的近似值的因果关系识别战略,这一战略依赖于某些一分一秒的矩阵的可视性。在这项工作中,我们通过使用光谱方法,将一步的代理人延伸到过去和将来,放松了这一要求。我们实证地比较了我们的OPE方法与现有的方法,并表明其预测的准确性和更加笼统性。 最后,我们得出了一种独立的重要性抽样算法,该算法依赖于等级、区别性和假设性条件,而不是Tenenholtz等人(2020年a)要求的奖赏和隐藏状态结构方面的可观察轨迹的严格充分条件。