We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters.
翻译:我们提出一个新的模型, 独立的线性 Markov 游戏, 用于多试剂强化学习, 使用较大的状态空间和大量的代理商。 这是一个具有独立线性功能近效的Markov 游戏类别, 每个代理商都有自己的功能近似值, 被其他玩家的政策所忽略。 我们设计新的算法, 用于学习 Markov 粗略相关 equilibria (CCE) 和 Markov 相关 equilibria (CE), 其样本复杂度范围只能与每个代理商自身的功能类别复杂度成倍缩放, 从而打破多试剂的诅咒。 相反, 现有的 Markov 游戏中, 使用 线性 线性 样性 的复杂度 范围与\emphy{ 联合行动空间 的大小相匹配 。 我们的算法依赖于两个关键技术创新:(1) 利用政策重现, 解决多个代理商的不透明性, 和功能更接近性 ; (2) 将 学习 Markov 和 探索多级游戏中的 和探索性, 显示 直径性 正在 学习 学习 完全的 C- 或 。</s>