We consider the problem of finding the best memoryless stochastic policy for an infinite-horizon partially observable Markov decision process (POMDP) with finite state and action spaces with respect to either the discounted or mean reward criterion. We show that the (discounted) state-action frequencies and the expected cumulative reward are rational functions of the policy, whereby the degree is determined by the degree of partial observability. We then describe the optimization problem as a linear optimization problem in the space of feasible state-action frequencies subject to polynomial constraints that we characterize explicitly. This allows us to address the combinatorial and geometric complexity of the optimization problem using recent tools from polynomial optimization. In particular, we demonstrate how the partial observability constraints can lead to multiple smooth and non-smooth local optimizers and we estimate the number of critical points.
翻译:我们考虑了如何找到最佳无记忆的随机政策,以建立一个具有有限状态和行动空间的无限视点部分可观测的Markov决策过程(POMDP),在折扣标准或平均奖励标准方面有一定的状态和行动空间。我们表明,(折扣的)州行动频率和预期累积奖励是该政策的合理功能,其程度取决于部分可观察的程度。我们然后将优化问题描述为在受到我们明确描述的多元制约的情况下可行的州行动频率空间的线性优化问题。这使我们能够利用最近多面优化的工具解决优化问题的组合和几何复杂问题。我们尤其展示了部分可观察性制约因素如何导致多重平稳和非湿润的地方优化,我们估计了临界点的数量。