We study the performance of policy gradient methods for the subclass of Markov games known as Markov potential games (MPGs), which extends the notion of normal-form potential games to the stateful setting and includes the important special case of the fully cooperative setting where the agents share an identical reward function. Our focus in this paper is to study the convergence of the policy gradient method for solving MPGs under softmax policy parameterization, both tabular and parameterized with general function approximators such as neural networks. We first show the asymptotic convergence of this method to a Nash equilibrium of MPGs for tabular softmax policies. Second, we derive the finite-time performance of the policy gradient in two settings: 1) using the log-barrier regularization, and 2) using the natural policy gradient under the best-response dynamics (NPG-BR). Finally, extending the notion of price of anarchy (POA) and smoothness in normal-form games, we introduce the POA for MPGs and provide a POA bound for NPG-BR. To our knowledge, this is the first POA bound for solving MPGs. To support our theoretical results, we empirically compare the convergence rates and POA of policy gradient variants for both tabular and neural softmax policies.
翻译:我们对被称为Markov潜在游戏(MPGs)的Markov亚类游戏的政策梯度方法的绩效进行了研究,该方法将正常形式潜在游戏的概念扩展到了状态环境,并包括了完全合作环境的重要特例,代理商在其中拥有相同的奖赏功能。我们本文件的重点是研究政策梯度方法在软式政策参数化下解决MPG的政策梯度方法的趋同性,既采用表格形式,又采用普通功能相近者,如神经网络。我们首先展示了这一方法在表格软式政策中与MPG的纳什平衡的不相称性趋同性。第二,我们用两种环境来得出政策梯度的有限性表现:1)使用对日志屏障的正规化,2)在最佳反应动态(NPG-BR)下使用自然政策梯度。最后,扩展了无政府状态价格概念和正常形式游戏的平滑度,我们为MPGs引入了POA,并为NPG-BR提供了约束性PA。据我们了解,这是第一个对解决MPG政策趋同度的PA政策等级的PA约束。