We examine global non-asymptotic convergence properties of policy gradient methods for multi-agent reinforcement learning (RL) problems in Markov potential games (MPG). To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, we propose new independent policy gradient algorithms that are run by all players in tandem. When there is no uncertainty in the gradient evaluation, we show that our algorithm finds an $\epsilon$-Nash equilibrium with $O(1/\epsilon^2)$ iteration complexity which does not explicitly depend on the state space size. When the exact gradient is not available, we establish $O(1/\epsilon^5)$ sample complexity bound in a potentially infinitely large state space for a sample-based algorithm that utilizes function approximation. Moreover, we identify a class of independent policy gradient algorithms that enjoys convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played. Finally, we provide computational experiments to corroborate the merits and the effectiveness of our theoretical developments.
翻译:我们在Markov潜在游戏(MPG)中研究多试剂加固学习(RL)问题的政策梯度方法的全球非非同步趋同特性。 要学习MPG的纳什平衡, 国家空间大小和/或玩家数目可能非常大, 我们建议由所有玩家同时运行新的独立政策梯度算法。 当梯度评价没有不确定性时, 我们显示我们的算法找到一个以美元( 1/\\ epsilon2) 美元( $) 的循环复杂度( $) 的平衡, 这并不明确取决于国家空间大小。 当没有精确的梯度时, 我们为使用功能近似值的样基算法, 建立可能无限大的空间 的样本复杂度 。 此外, 我们确定了一类独立的政策梯度算法, 与那些忽略游戏种类的玩家 Markov 游戏和 Markov 合作游戏都具有趋同性。 最后, 我们提供计算实验, 以证实我们理论发展的优点和有效性。