Robust Markov decision processes (RMDPs) are promising models that provide reliable policies under ambiguities in model parameters. As opposed to nominal Markov decision processes (MDPs), however, the state-of-the-art solution methods for RMDPs are limited to value-based methods, such as value iteration and policy iteration. This paper proposes Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs with a global convergence guarantee in tabular problems. Unlike value-based methods, DRPG does not rely on dynamic programming techniques. In particular, the inner-loop robust policy evaluation problem is solved via projected gradient descent. Finally, our experimental results demonstrate the performance of our algorithm and verify our theoretical guarantees.
翻译:Robust Markov 决策程序(RMDPs)是很有希望的模式,在模型参数含糊不清的情况下提供了可靠的政策。然而,与名义的Markov 决策程序(MDPs)相比,RMDPs最先进的解决方案方法仅限于基于价值的方法,如价值迭代和政策迭代。本文提出了在表格问题中具有全球趋同保证的RMDPs第一种通用政策梯度方法,即“双棒”政策梯度方法(DRPG)。与基于价值的方法不同,DRPG并不依赖动态的编程技术。特别是,内环稳健的政策评价问题通过预测梯度下降解决。最后,我们的实验结果展示了我们的算法绩效并验证了我们的理论保证。