This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents' strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an $\epsilon$-approximate CCE in at most $\widetilde{O}( H^6S A /\epsilon^2)$ episodes, where $S$ is the number of states, $A$ is the size of the largest individual action space, and $H$ is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is fully \emph{decentralized}, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
翻译:本文通过分散的多试剂强化学习,解决了通过分散式多试剂强化学习在一般和Markov游戏中有效学习平衡的问题。 鉴于计算纳什平衡(NE)的根本困难, 我们相反的目标是寻找粗化的关联平衡(CCE), 一种允许代理人战略之间可能关联的解决方案概念, 将NE普遍化。 我们提出了一个算法, 使每个代理人独立运行乐观的V- 学习( Q- 学习的变种) 来有效探索未知环境, 而同时使用稳定的在线镜底( OMD) 亚例来进行政策更新。 我们显示, 代理人可以找到一个在最大程度上的 $\ epsilon $- papload CCE 。 (H6S A/\ eepslon% 2) 中找到一个粗化的关联平衡(CCCCE) 。 这个匹配的平衡(CCCEE),, 也就是, $A $A 是最大的个人行动空间的大小, 和 $H 长 。 似乎, 我们的递增的代理的快速的排序。