Modified policy iteration (MPI) also known as optimistic policy iteration is at the core of many reinforcement learning algorithms. It works by combining elements of policy iteration and value iteration. The convergence of MPI has been well studied in the case of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, modified policy iteration is relatively unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems. The proof of approximate modified policy iteration for risk sensitive MDPs is also provided in the appendix.
翻译:许多强化学习算法的核心是乐观的政策迭代(MPI),它通过合并政策迭代和价值迭代等要素发挥作用。在折扣和平均成本MDP的情况下,对MPI的趋同进行了深入研究。在这项工作中,我们考虑了指数成本敏感的MDP配方,已知该配方为模型参数提供了一定的稳健性。虽然在风险敏感 MDP 的背景下对政策迭代和价值迭代进行了深入研究,但修改后的政策迭代相对未进行探讨。我们提供了第一个证据,证明MPI在有限的状态和行动空间的情况下,也集中了风险敏感问题。由于指数成本配方处理的是多复制的Bellman方程式,我们的主要贡献是一种趋同证据,它与关于折扣和风险中性平均成本问题的现有结果大不相同。附录中也提供了风险敏感MDP的近似修改政策迭代证据。