Multi-Agent Reinforcement Learning (MARL) -- where multiple agents learn to interact in a shared dynamic environment -- permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.
翻译:多机构强化学习(MARL) -- -- 多个代理商学会在共同的动态环境中互动 -- -- 贯穿于各种关键应用中。虽然在了解单一代理商RL的政策优化方法全球趋同方面已取得长足进展,但设计和分析MARL环境中有效的政策优化算法却提出了重大挑战,不幸的是,现有理论仍然没有充分解决这些挑战。在本文件中,我们侧重于竞争性多机构RL的最基本环境,即双玩者零和马尔科夫游戏,以及研究无限偏重折扣设置和定偏偏偏偏单设置中的平衡查找算法。我们提出了单 Loop政策优化方法,由两个代理商提供对称性更新,通过超常规的乐观多复制权重更新(OMWU)方法更新政策,并以较慢的时间尺度更新其价值。我们通过全面信息表格设置,显示拟议方法在无限时间上实现了与常规化政策问题的四端反应平衡的线性平衡的平衡。我们建议采用单一操作式政策优化方法,将我们所知道的趋同性政策趋同到更趋同性。我们所了解的标准化的亚级政策趋同程度。