Partially observable Markov decision processes (POMDPs) are standard models for dynamic systems with probabilistic and nondeterministic behaviour in uncertain environments. We prove that in POMDPs with long-run average objective, the decision maker has approximately optimal strategies with finite memory. This implies notably that approximating the long-run value is recursively enumerable, as well as a weak continuity property of the value with respect to the transition function.
翻译:部分可观测的Markov 决策程序(POMDPs)是在不确定环境中具有概率和非决定性行为的动态系统的标准模型。我们证明,在具有长期平均目标的POMDPs中,决策者拥有具有有限内存的大致最佳战略。这特别意味着,接近长期价值是可再生数字的,而且相对于过渡功能而言,其价值的连续性属性薄弱。