Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,\epsilon\})$ and a sample complexity of $\mathcal{O}(\epsilon^{-3})$ for any $\epsilon >0$.
翻译:常规战地控制( MFC) 是解决多点强化学习( MARL) 问题的有力工具 。 最近的研究显示, MFC 在人口规模大且代理商可以互换的情况下, 能够非常接近 MARL 。 不幸的是, 套用互换性假设意味着所有代理商在很多实际假想中都相互一致地互动, 而在许多实际假想中, 情况并非如此。 在本篇文章中, 我们放松了对可互换性的假设, 并通过一个任意的双轨式矩阵来模拟代理商之间的互动。 因此, 在我们的框架中, 不同代理商的“ 外观” “ 外观 ” 。 我们证明, 如果每个代理商的奖赏是该代理商所看到的平均场的缩略函数。 那么, 在一个错误中, 所有的MARL( grac) (\\\\\ sqrqrq) {NAR_ mal_ cal_ cal_ assal_ assal_ assal_ assal_ a anqal_ lax_ gal_ gal_ assal_ gal_ ass_ ass_ ass_ ass_ a assal_ dal_ assal_ a_ a__________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx