Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.
翻译:部分可观察的 Markov 决策程序( POMDPs) 是强化学习的自然和一般模式, 考虑到了代理人对当前状态的不确定性。 在POMDPs的文献中, 习惯的做法是在参数为人所知时, 接受一个计算最佳政策的准极时算法, 尽管问题已知是计算困难的。 几乎所有现有的规划算法要么在指数时间里运行, 缺乏可证实的绩效保障, 或要求在每一项可能的政策下对过渡动态进行强有力的假设。 在这项工作中, 我们重新审视规划问题, 并询问: 是否有自然和动机良好的假设使规划变得容易? 在POMDPs的文献中, 我们的主要结果是在(一步)可见的POMDPs中进行规划的准极时算算法。 具体地说, 我们假设, 州内分布得分明的分布导致观察分布得分分得分得分,因此观察每一步都至少多少有些信息。 关键是, 这一假设没有限制POMDP的过渡动态动态; 然而, 它意味着, 近为接近的市级政策承认在( ) 精确的准确的准确的准确的内, 我们的准确的精确的精确的估算是基于我们的标准的精确的精确的精确的判断。