Multi-agent reinforcement learning (MARL) has become effective in tackling discrete cooperative game scenarios. However, MARL has yet to penetrate settings beyond those modelled by team and zero-sum games, confining it to a small subset of multi-agent systems. In this paper, we introduce anew generation of MARL learners that can handle nonzero-sum payoff structures and continuous settings. In particular, we study the MARL problem in a class of games known as stochastic potential games (SPGs) with continuous state-action spaces. Unlike cooperative games, in which all agents share a common reward, SPGs are capable of modelling real-world scenarios where agents seek to fulfil their individual goals. We prove theoretically our learning method, SPot-AC, en-ables independent agents to learn Nash equilibrium strategies in polynomial time. We demonstrate our framework tackles previously unsolvable tasks such as Coordination Navigation and large selfish routing games and that it outperforms the state of the art MARL baselines such as MADDPG and COMIX in such scenarios.
翻译:多试剂强化学习(MARL)在应对分散合作游戏情景方面已经变得有效,然而,MARL尚未渗透到由团队和零和游戏模拟的场景之外,尚未渗透到由团队和零和游戏模拟的场景之外的环境,将它局限在一小部分多试剂系统上。在本文中,我们引入新一代MARL学习者,他们能够处理非零和零报酬结构和连续设置。特别是,我们在一个称为随机潜在游戏(SPGs)的类游戏中研究MARL问题,这种游戏具有持续的州际行动空间。与合作游戏不同,所有代理都分享共同的奖励,SPGs能够模拟真实世界情景,让代理商寻求实现他们各自的目标。我们从理论上证明了我们的学习方法,SPot-AC,enables 独立代理商在多元时间学习纳什平衡战略。我们展示了我们的框架,解决了以前无法解决的任务,如协调导航和大型自私路线游戏,并且它超越了在这种情景中MADPG和COMIX等现代MAL基线的状况。