Many real-world applications of multi-agent reinforcement learning (RL), such as multi-robot navigation and decentralized control of cyber-physical systems, involve the cooperation of agents as a team with aligned objectives. We study multi-agent RL in the most basic cooperative setting -- Markov teams -- a class of Markov games where the cooperating agents share a common reward. We propose an algorithm in which each agent independently runs stage-based V-learning (a Q-learning style algorithm) to efficiently explore the unknown environment, while using a stochastic gradient descent (SGD) subroutine for policy updates. We show that the agents can learn an $\epsilon$-approximate Nash equilibrium policy in at most $\propto\widetilde{O}(1/\epsilon^4)$ episodes. Our results advocate the use of a novel \emph{stage-based} V-learning approach to create a stage-wise stationary environment. We also show that under certain smoothness assumptions of the team, our algorithm can achieve a nearly \emph{team-optimal} Nash equilibrium. Simulation results corroborate our theoretical findings. One key feature of our algorithm is being \emph{decentralized}, in the sense that each agent has access to only the state and its local actions, and is even \emph{oblivious} to the presence of the other agents. Neither communication among teammates nor coordination by a central controller is required during learning. Hence, our algorithm can readily generalize to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
翻译:多试剂强化学习(RL)的许多实际应用,例如多机器人导航和网络物理系统分散控制等,涉及代理人作为团队的合作,以达到一致的目标。我们在最基本的合作环境中研究多剂RL -- -- Markov团队 -- -- 马尔科夫游戏,合作者分享共同的奖赏。我们建议一种算法,让每个代理人独立运行基于阶段的V学习(Q-学习风格算法),以有效探索未知的环境,同时使用随机梯度下行(SGD)亚路径来更新政策。我们表明,代理人可以学习美元-美元-接近的纳什平衡政策,最多以$\propto\全局{O}(1/\\epsilon_4)为基底。我们的成果主张使用新型的 &emph{阶段学习(Q-学习风格算法)方法来创造阶段性固定的环境。我们还表明,在团队的某些平滑度假设下,我们的算法可以实现接近em-emph{te-opyal}纳什平级平衡(Nash-apalalalalalalalalalal) 平衡, 只能通过直截路路路路路由我们学习到正常的服务器。