Computing Nash equilibrium policies is a central problem in multi-agent reinforcement learning that has received extensive attention both in theory and in practice. However, provable guarantees have been thus far either limited to fully competitive or cooperative scenarios or impose strong assumptions that are difficult to meet in most practical applications. In this work, we depart from those prior results by investigating infinite-horizon \emph{adversarial team Markov games}, a natural and well-motivated class of games in which a team of identically-interested players -- in the absence of any explicit coordination or communication -- is competing against an adversarial player. This setting allows for a unifying treatment of zero-sum Markov games and Markov potential games, and serves as a step to model more realistic strategic interactions that feature both competing and cooperative interests. Our main contribution is the first algorithm for computing stationary $\epsilon$-approximate Nash equilibria in adversarial team Markov games with computational complexity that is polynomial in all the natural parameters of the game, as well as $1/\epsilon$. The proposed algorithm is particularly natural and practical, and it is based on performing independent policy gradient steps for each player in the team, in tandem with best responses from the side of the adversary; in turn, the policy for the adversary is then obtained by solving a carefully constructed linear program. Our analysis leverages non-standard techniques to establish the KKT optimality conditions for a nonlinear program with nonconvex constraints, thereby leading to a natural interpretation of the induced Lagrange multipliers. Along the way, we significantly extend an important characterization of optimal policies in adversarial (normal-form) team games due to Von Stengel and Koller (GEB `97).
翻译:多试剂强化学习(无论在理论还是实践上都得到广泛关注)中,纳什平衡政策是多试剂强化学习中的一个中心问题。然而,到目前为止,可以证明的保障要么局限于完全竞争或合作的场景,要么局限于完全竞争或合作的场景,要么强加在最实际的应用中难以达到的更现实的战略互动模型。在这项工作中,我们的主要贡献是调查对抗性球队Markov游戏中具有计算复杂性、在游戏所有自然参数中具有多元性、以及1美元/欧元/美元之间没有明确协调或沟通的一组玩家。这一设定使得对马可夫的零和马可夫潜在游戏进行统一处理,并成为构建更现实的战略互动的一个步骤,既具有竞争利益,又具有合作性的利益。我们的主要贡献是第一个算法,用于计算固定值 $\blislon$-appoint Nash equilililible的游戏,其计算复杂性在游戏的所有自然参数中具有多元性,同时也是1美元/\eplonlon$。拟议的算法特别自然和实用性,它基于不透明性的政策规则,从一个独立的平极值分析,然后为我们每个的平级平级的平级的平极级的平级的平级程序。