Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound. This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games with $H$ steps, $S$ states, and $A_i$ actions per player. First, we design algorithms for learning an $\epsilon$-Coarse Correlated Equilibrium (CCE) in $\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / \epsilon^2)$ episodes, and an $\epsilon$-Correlated Equilibrium (CE) in $\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / \epsilon^2)$ episodes. This is the first line of results for learning CCE and CE with sample complexities polynomial in $\max_{i\le m} A_i$. Our algorithm for learning CE integrates an adversarial bandit subroutine which minimizes a weighted swap regret, along with several novel designs in the outer loop. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $\epsilon$-approximate Nash equilibrium within $\widetilde{\mathcal{O}}(S\sum_{i\le m} A_i / \epsilon^3)$ episodes (when only highlighting the dependence on $S$, $A_i$, and $\epsilon$), which only depends linearly in $\sum_{i\le m} A_i$ and significantly improves over the best known algorithm in the $\epsilon$ dependence. Overall, our results shed light on what equilibria or structural assumptions on the game may enable sample-efficient learning with many players.
翻译:多剂加固学习在与众多玩家一起解决游戏方面已取得了巨大的经验性进展。然而,理论上,在普通游戏中找到纳什平衡的最已知样本复杂度,由于联合行动空间的大小,在玩家数量上呈指数性更低的束缚。本文调查了学习目标在设定美元玩家普通和马尔多夫游戏时,只有H美元、S美元和每个玩家的动作。首先,我们设计了用于在普通游戏中学习$-Ceplall$的算法。 在普通游戏中,我们设计了最已知的样本复杂度,在平面游戏中,我们设计了以美元平面游戏中,以美元平价平价平价计算。