In this paper, we provide a novel algorithm for solving planning and learning problems of Markov decision processes. The proposed algorithm follows a policy iteration-type update by using a rank-one approximation of the transition probability matrix in the policy evaluation step. This rank-one approximation is closely related to the stationary distribution of the corresponding transition probability matrix, which is approximated using the power method. We provide theoretical guarantees for the convergence of the proposed algorithm to optimal (action-)value function with the same rate and computational complexity as the value iteration algorithm in the planning problem and as the Q-learning algorithm in the learning problem. Through our extensive numerical simulations, however, we show that the proposed algorithm consistently outperforms first-order algorithms and their accelerated versions for both planning and learning problems.
翻译:本文提出了一种用于解决马尔可夫决策过程规划与学习问题的新算法。该算法遵循策略迭代型更新,在策略评估步骤中使用转移概率矩阵的秩一近似。此秩一近似与对应转移概率矩阵的平稳分布密切相关,该平稳分布通过幂法进行近似。我们为所提算法收敛到最优(动作-)值函数提供了理论保证,其在规划问题中具有与值迭代算法相同的收敛速率和计算复杂度,在学习问题中则与Q学习算法相当。然而,通过大量数值模拟,我们证明所提算法在规划与学习问题上始终优于一阶算法及其加速版本。