在一个不可分的共享全球国家存在时,多机构强化学习接近于多机构强化学习 (Mean-Field Control based Approximation of Multi-Agent Reinforcement Learning in Presence of a Non-decomposable Shared Global State)

Mean Field Control (MFC) is a powerful approximation tool to solve large-scale Multi-Agent Reinforcement Learning (MARL) problems. However, the success of MFC relies on the presumption that given the local states and actions of all the agents, the next (local) states of the agents evolve conditionally independent of each other. Here we demonstrate that even in a MARL setting where agents share a common global state in addition to their local states evolving conditionally independently (thus introducing a correlation between the state transition processes of individual agents), the MFC can still be applied as a good approximation tool. The global state is assumed to be non-decomposable i.e., it cannot be expressed as a collection of local states of the agents. We compute the approximation error as $\mathcal{O}(e)$ where $e=\frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|} +\sqrt{|\mathcal{U}|}\right]$. The size of the agent population is denoted by the term $N$, and $|\mathcal{X}|, |\mathcal{U}|$ respectively indicate the sizes of (local) state and action spaces of individual agents. The approximation error is found to be independent of the size of the shared global state space. We further demonstrate that in a special case if the reward and state transition functions are independent of the action distribution of the population, then the error can be improved to $e=\frac{\sqrt{|\mathcal{X}|}}{\sqrt{N}}$. Finally, we devise a Natural Policy Gradient based algorithm that solves the MFC problem with $\mathcal{O}(\epsilon^{-3})$ sample complexity and obtains a policy that is within $\mathcal{O}(\max\{e,\epsilon\})$ error of the optimal MARL policy for any $\epsilon>0$.

翻译：域控 (MFC) 是一个强大的近似工具, 用来解决大型多点强化学习( MARL) 的问题。但是, MFC 的成功取决于这样的假设, 即考虑到所有代理商的当地状态和行动, 代理商的下一个( 当地) 状态有条件地演变为彼此独立。这里我们证明, 即使在一个 MARL 设置中, 代理商除了各自独立演变的当地状态之外, 共享一个共同的全球状态 ( 引入单个代理商的国家过渡过程之间的关联 ), MFC 仍然可以作为一个良好的近似工具被应用。假设全球状态是不可反的 $ Ort; 无法表现为当地代理商的集合。我们将近似错误解读为 $\ malthcal{ 1\\\ sqrrt{ n\\\\\\\\\\\\\\\ mar\ max max max max cral aral deal deal aqral demode, max max fral fral demodeal demotions max max max max max max max max max max max max max max max max max max max max max max max maxx maxxx max max max max max max max max max max max max max max max maxx maxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx mods mox mocal mox max mocal mocal mocal mocal mox moxxxxxxxx mocal mos mods moxxxxxxxxxxxxxxx moxxxxxxxx moxx mox mox