This paper studies policy optimization algorithms for multi-agent reinforcement learning. We begin by proposing an algorithm framework for two-player zero-sum Markov Games in the full-information setting, where each iteration consists of a policy update step at each state using a certain matrix game algorithm, and a value update step with a certain learning rate. This framework unifies many existing and new policy optimization algorithms. We show that the state-wise average policy of this algorithm converges to an approximate Nash equilibrium (NE) of the game, as long as the matrix game algorithms achieve low weighted regret at each state, with respect to weights determined by the speed of the value updates. Next, we show that this framework instantiated with the Optimistic Follow-The-Regularized-Leader (OFTRL) algorithm at each state (and smooth value updates) can find an $\mathcal{\widetilde{O}}(T^{-5/6})$ approximate NE in $T$ iterations, which improves over the current best $\mathcal{\widetilde{O}}(T^{-1/2})$ rate of symmetric policy optimization type algorithms. We also extend this algorithm to multi-player general-sum Markov Games and show an $\mathcal{\widetilde{O}}(T^{-3/4})$ convergence rate to Coarse Correlated Equilibria (CCE). Finally, we provide a numerical example to verify our theory and investigate the importance of smooth value updates, and find that using "eager" value updates instead (equivalent to the independent natural policy gradient algorithm) may significantly slow down the convergence, even on a simple game with $H=2$ layers.
翻译:本文研究多试剂强化学习的政策优化算法 。 我们首先为全信息环境下的双玩者马可夫运动会提出一个算法框架, 每个迭代包括每个州的政策更新步骤, 使用特定的矩阵游戏算法, 以及某个学习率的值更新步骤 。 这个框架统一了许多现有和新的政策优化算法。 我们显示, 只要矩阵游戏算法在每一个州对由价值更新的速度所决定的权重有低加权遗憾, 这个算法的平均政策将接近于游戏的纳什平衡( NE ) 。 我们显示, 只要矩阵游戏算法在每一个州对由价值更新的速度所决定的权重低。 下一步, 我们显示, 这个框架与每个州最优化的 后续追踪游戏算法( OTRL) (O) (OTRL) (O- LE) (OLO) (O) (OLO) (OLO) (LO) 和 RLO AS 的 AL UR 等的自然比值, 可以提供以美元( IM AL ) (美元( ) (O) AL- OL) (O) (O) (O) OL) OL) OL 的比 的比 的比 。