V-学习 -- -- 多试剂RL简单、高效、分散的分级算法 (V-Learning -- A Simple, Efficient, Decentralized Algorithm for Multiagent RL)

from arxiv, This is the journal version of arXiv:2006.12007, with new results on (1) finding CE and CCE in the multiplayer general-sum setting, (2) monotonic techniques that allow V-learning to output Markov policies in a subset of settings, and (3) decoupling V-learning with the adversarial bandit subroutine

A major challenge of multiagent reinforcement learning (MARL) is the curse of multiagents, where the size of the joint action space scales exponentially with the number of agents. This remains to be a bottleneck for designing efficient MARL algorithms even in a basic scenario with finitely many states and actions. This paper resolves this challenge for the model of episodic Markov games. We design a new class of fully decentralized algorithms -- V-learning, which provably learns Nash equilibria (in the two-player zero-sum setting), correlated equilibria and coarse correlated equilibria (in the multiplayer general-sum setting) in a number of samples that only scales with $\max_{i\in[m]} A_i$, where $A_i$ is the number of actions for the $i^{\rm th}$ player. This is in sharp contrast to the size of the joint action space which is $\prod_{i=1}^m A_i$. V-learning (in its basic form) is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into a RL algorithm. Similar to the classical Q-learning algorithm, it performs incremental updates to the value functions. Different from Q-learning, it only maintains the estimates of V-values instead of Q-values. This key difference allows V-learning to achieve the claimed guarantees in the MARL setting by simply letting all agents run V-learning independently.

翻译：多试剂加固学习(MARL)的重大挑战是多试剂的诅咒, 多试剂是联合行动空间的大小随代理人的数量而成。这仍然是设计有效的MARL算法的瓶颈, 即使是在一个基本假设中, 许多状态和行动有限。本文解决了对 Sindsodic Markov 游戏模型的挑战。我们设计了一个新的完全分散的算法类别 -- -- V- 学习, 它可以学习Nash equilibria( 在两个玩家零和组合设置中), 与联合行动空间( 在多玩家总和设置中) 的大小相对平衡和相似的平衡( 在多玩家总和设置中) 。这仍然是一些样本中设计有效的MARL算算法的瓶颈。 $A_ i 是要解决这个挑战, 美元是 $@rcmthnd Mark 玩家玩家游戏的动作数量。这与联合行动空间的大小形成鲜明对比, 也就是在两个玩家零和1 um A_ i 学习中, V- 学习( 在基本形式中) 是一个新的类别, 将一个单一代理机值的Ralvalueal- l 算算算法更新到任何递增的排序, 将一个稳定的算法, 更新到任何磁段的递算法。