Multi-agent reinforcement learning (MARL) algorithms often suffer from an exponential sample complexity dependence on the number of agents, a phenomenon known as \emph{the curse of multiagents}. In this paper, we address this challenge by investigating sample-efficient model-free algorithms in \emph{decentralized} MARL, and aim to improve existing algorithms along this line. For learning (coarse) correlated equilibria in general-sum Markov games, we propose \emph{stage-based} V-learning algorithms that significantly simplify the algorithmic design and analysis of recent works, and circumvent a rather complicated no-\emph{weighted}-regret bandit subroutine. For learning Nash equilibria in Markov potential games, we propose an independent policy gradient algorithm with a decentralized momentum-based variance reduction technique. All our algorithms are decentralized in that each agent can make decisions based on only its local information. Neither communication nor centralized coordination is required during learning, leading to a natural generalization to a large number of agents. We also provide numerical simulations to corroborate our theoretical findings.
翻译:多试剂强化学习(MARL)算法往往受到对物剂数量的指数样本复杂程度的依赖,这种现象被称为多试剂的诅咒。在本文中,我们通过在 mARL 中调查无样本效果的无模型算法来应对这一挑战,目的是改进沿此行现有的算法。在一般和Markov 游戏中,为了学习(粗略)相关平衡平衡,我们提议V学习算法,大大简化最近作品的算法设计和分析,绕过一个相当复杂的无-emph{加权}-regret bandroit 子路由。在Markov 潜在游戏中学习Nash equilibria 时,我们提出一个独立的政策梯度算法,以分散的动力减低差异技术。我们的所有算法都分散到每个物剂只能根据当地信息作出决定。在学习期间不需要通信或集中协调,从而导致自然普遍化给大量物剂。我们还提供了数字模拟,以证实我们的理论结论。