We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.
翻译:我们研究部分可观测的 MDP (POMDP) 的离岸评估(OPE) 。现有的方法,例如顺序重要性抽样估计器和适应性-Q评价,在POMDP 中受到地平线的诅咒。为了回避这一问题,我们开发了一种新的无模式的OPE方法,引入了未来依赖性价值的功能,将未来的代理人作为投入。未来依赖性价值功能在完全可观测的 MDP 中扮演着与传统价值功能相似的作用。我们为未来依赖性价值函数制定了一个新的Bellman方程式,作为有条件的瞬间方程式,使用历史代号作为工具变量。我们进一步提出了利用新的贝尔曼方程式学习未来依赖性价值函数的微型学习方法。我们获得了PAC结果,这意味着只要未来和历史包含关于潜在状态和贝尔曼完整性的充分信息,我们的OPE 估计器就具有一致性。最后,我们将我们的方法扩大到学习动态并确定我们的方法与POMDP 中众所周知的光谱学习方法之间的联系。