Optimal policies in standard MDPs can be obtained using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in (Hansen et al., 2013) that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zero-sum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway used in practical learning algorithms for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the much-studied linear MDPs. We illustrate the application of our new policy iteration algorithm by providing sample and time complexity bounds for policy-based RL (reinforcement learning) algorithms.
翻译:暂无翻译