Policy Iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov Decision Problems (MDPs). We derive upper bounds on the running time of PI on Deterministic MDPs (DMDPs): the class of MDPs in which every state-action pair has a unique next state. Our results include a non-trivial upper bound that applies to the entire family of PI algorithms, and affirmation that a conjecture regarding Howard's PI on MDPs is true for DMDPs. Our analysis is based on certain graph-theoretic results, which may be of independent interest.
翻译:政策迭代(PI)是用来计算Markov决策问题最佳政策的一套广泛使用的算法。 我们从PI关于决定性 MDP(DMDP)的运行时间中获得上限:每对州-对都有独特的下一个状态的MDP类别。 我们的结果包括适用于PI算法整个家族的非三进制上限,并确认Howard关于MDP的PI的猜想对DMDP是真实的。 我们的分析基于某些图形理论结果,这些结果可能具有独立的兴趣。