Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.
翻译:多智能体相互作用在强化学习中变得越来越重要,策略梯度方法的理论基础吸引了越来越多的研究者的关注。我们研究了多智能体学习中自然策略梯度(NPG)算法的全局收敛性,证明了普通的NPG算法可能没有参数收敛性,即,即使成本被规范化(这在文献中启用了强策略空间收敛性保证),参数也可能没有收敛。这种参数的不收敛导致了学习的稳定性问题。这在函数逼近的情况下尤其重要。在此情况下,我们只能操作低维度的参数,而不是高维度的策略。我们对多种标准的多智能体学习场景提出了NPG算法的变种:两个玩家的博弈矩阵和Markov游戏以及多人单调博弈。我们也将结果推广到某些函数逼近的情况下。值得注意的是,在我们的算法中,代理人扮演了对称的角色。我们的结果对于解决某些具有结构的非凸性 - 非凹性最小化极小式优化问题也具有独立的利益。我们还提供了模拟结果以印证我们的理论发现。