We consider the problem of finding the best memoryless stochastic policy for an infinite-horizon partially observable Markov decision process (POMDP) with finite state and action spaces with respect to either the discounted or mean reward criterion. We show that the (discounted) state-action frequencies and the expected cumulative reward are rational functions of the policy, whereby the degree is determined by the degree of partial observability. We then describe the optimization problem as a linear optimization problem in the space of feasible state-action frequencies subject to polynomial constraints that we characterize explicitly. This allows us to address the combinatorial and geometric complexity of the optimization problem using recent tools from polynomial optimization. In particular, we estimate the number of critical points and use the polynomial programming description of reward maximization to solve a navigation problem in a grid world.
翻译:我们考虑了如何找到最佳的无记忆的随机政策,以建立一个具有有限状态和行动空间的无限视点部分可观测的马尔科夫决策过程(POMDP),在折扣或平均奖励标准方面有一定的状态和行动空间。我们表明,(折扣的)州-行动频率和预期累积奖励是该政策的合理功能,其程度由部分可观察程度决定。然后我们将优化问题描述为受我们明确描述的多种制约的可行的州-行动频率空间的线性优化问题。这使我们能够利用最近多面优化的工具解决优化问题的组合和几何复杂问题。特别是,我们估计了临界点的数量,并使用奖励最大化的多面性方案描述来解决电网世界的航行问题。