Many model-based reinforcement learning (RL) algorithms can be viewed as having two phases that are iteratively implemented: a learning phase where the model is approximately learned and a planning phase where the learned model is used to derive a policy. In the case of standard MDPs, the learning problem can be solved using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in Hansen et al. (2013) that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zero-sum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway often used in RL for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the linear MDPs and have been the subject of much attention recently. We then consider multi-agent reinforcement learning which uses our algorithm in the planning phases, and provide sample and time complexity bounds for such an algorithm.
翻译:许多基于模型的强化学习算法可以被视为具有两个阶段,即学习阶段和计划阶段。在学习阶段,模型被近似学习;在计划阶段,学习得到的模型被用于推导出一种策略。在标准MDPs的情况下,可以使用值迭代或策略迭代来解决学习问题。然而,在零和马尔科夫博弈的情况下,目前并没有有效的策略迭代算法;例如,在Hansen等人(2013)的研究中,已经证明人们必须解决Ω(1/(1-alpha))个MDPs,其中alpha是折扣系数,才能实现已知的收敛版本的策略迭代。此外,另一个用于马尔科夫零和游戏的算法称为naive policy iteration,该算法很容易实现,但仅在极其限制的条件下被证明具有收敛性。先前修复naive policy iteration算法的尝试有几个限制。本文提出了一个简单的naive policy iteration算法的变体,使其在游戏中达到收敛,而且收敛速度指数级。我们对naive policy iteration算法的唯一补充是在策略改进阶段使用前瞻。这是有吸引力的,因为在RL中通常会使用前瞻查找。我们进一步表明,可以在线性马尔科夫博弈中有效地实现前瞻,这些博弈是线性MDPs的对应物,并且最近受到了广泛关注。然后,我们考虑使用我们的算法在规划阶段的多智能体强化学习,并为这种算法提供样本和时间复杂度的界限。