We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results in normal-form games. While recent work has shown that such algorithms exist for restricted settings (notably, when regret is defined with respect to deviations to Markovian policies), the question of whether independent no-regret learning can be achieved in the standard Markov game framework was open. We provide a decisive negative resolution this problem, both from a computational and statistical perspective. We show that: - Under the widely-believed assumption that PPAD-hard problems cannot be solved in polynomial time, there is no polynomial-time algorithm that attains no-regret in general-sum Markov games when executed independently by all players, even when the game is known to the algorithm designer and the number of players is a small constant. - When the game is unknown, no algorithm, regardless of computational efficiency, can achieve no-regret without observing a number of episodes that is exponential in the number of players. Perhaps surprisingly, our lower bounds hold even for seemingly easier setting in which all agents are controlled by a a centralized algorithm. They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute a coarse correlated equilibrium that is sparse in the sense that it can be represented as a mixture of a small number of product policies. The crux of our approach is a novel application of aggregation techniques from online learning, whereby we show that any algorithm for the SparseCCE problem can be used to compute approximate Nash equilibria for non-zero sum normal-form games.
翻译:我们考虑了在Markov Games中的分散式多智能体强化学习问题。一个基本问题是是否存在算法,当所有代理参与且分散执行时,将不会对每个玩家产生遗憾,类似于正态形式游戏中的收敛性结果。虽然最近的工作表明这样的算法存在于受限制的设置中(尤其是在遗憾与Markovian策略偏差相关的情况下),但问题是在标准Markov game框架中是否可以实现独立无后悔学习。我们从计算和统计的角度提供了一个决定性的负面解决方案。我们表明:在广为人知的假设PPAD难题不能在多项式时间内解决的情况下,没有多项式时间算法能够在所有玩家独立运行时在一般和Markov games中实现无后悔,即使已知游戏是POLYNOMIAL (n)-time可计算的(其中n是固定的玩家数目)。当游戏是未知的时候,没有算法,无论计算效率如何,在未观察到的情况下不能达到无后悔,而观察到的状态数是指数级的。也许令人惊讶的是,我们的下界甚至适用于所有代理都由一个集中式算法控制的看似更容易的设置。它们通过针对一个更简单的问题的下限证明,我们将该问题称为SparseCCE,其目标是计算一个粗略的相关均衡,该关系稀疏,意味着它可以表示为少量的乘积策略混合物。我们的方法的关键是在线学习中的聚合技术的新应用,我们表明SparseCCE问题的任何算法都可以用于计算非零和规范形式游戏的近似Nash均衡。