Markov 决策过程中普通二级顺序值迭代 (Generalized Second Order Value Iteration in Markov Decision Processes)

Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach.

翻译：值迭代是一种固定点迭代技术,用于在贴现的奖励Markov决定程序(MDP)中获取最佳值函数和政策。在这里,一个收缩操作员是反复构建和运用以达成最佳解决办法。值迭代是一种第一顺序方法,因此可能需要大量迭代才能达到最佳解决办法。连续放松是一种流行技术,可用于解决固定点方程式。文献显示,在磁盘驱动的特殊结构下,连续的过度放松技术比标准值迭代更快地计算最佳值函数。在这项工作中,我们建议采用牛顿-拉夫森方法对连续的放松值迭代方案适用第二顺序迭代程序。我们证明,我们算法与最佳解决办法的全球趋同是随机并显示第二个顺序趋同。我们通过实验,展示了我们拟议办法的有效性。