Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.
翻译:在强化学习方面,多剂互动日益重要,政策梯度方法的理论基础已吸引了研究兴趣。我们调查了在多剂学习中自然政策梯度算法(NPG)算法的全球趋同。我们首先表明,香草NPG可能没有参数趋同,即参数化政策的矢量趋同,即使成本是固定的(这使得在文献中政策空间的参数趋同保障具有强大的趋同保证力),这种参数的不趋同导致学习中的稳定性问题,这在功能近似设置中变得特别相关,在功能近似设置中,我们只能以低维参数而不是高维政策运作。我们然后提出NPG算法的变式,用于几种标准的多剂学习情景:即2个玩家零和马可夫游戏,以及多玩家单调游戏,同时提供全球最新地标参数趋同保证。我们还将结果推广到某些功能近似设置。注意,在我们的算法中,代理人具有对称作用的作用。我们的结果也可能是独立的兴趣,解决非conex-colmagnconstimcalconstalmaimcalmaturmaturmus。