平均实地多机构强化学习的变动不定政策优化政策:一个原则性办法 (Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach)

Multi-agent reinforcement learning (MARL) becomes more challenging in the presence of more agents, as the capacity of the joint state and action spaces grows exponentially in the number of agents. To address such a challenge of scale, we identify a class of cooperative MARL problems with permutation invariance, and formulate it as a mean-field Markov decision processes (MDP). To exploit the permutation invariance therein, we propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture. We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. Moreover, its sample complexity is independent of the number of agents. We validate the theoretical advantages of MF-PPO with numerical experiments in the multi-agent particle environment (MPE). In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors with a smaller number of model parameters, which is the key to its generalization performance.

翻译：多试剂强化学习(MARL)在更多的代理商在场的情况下变得更加具有挑战性,因为联合州和行动空间的能力在物剂数量上成倍增长。为了应对这种规模挑战,我们确定了一组合作性MARL随变而变化的问题,并将它设计成一个中位的马尔科夫决策程序(MDP ) 。为了利用其中的变异性,我们提议了中位场准政策优化算法(MF-PPPO ), 该算法的核心是变异-异性行为者-神经结构。我们证明MF-POPO以亚线性趋同速度达到了全球最佳政策。此外,其样本复杂性独立于物剂数量。我们用多试剂颗粒环境中的数值实验来验证MF-PPO的理论优势。我们特别表明,从变异性神经结构中引入的诱导偏差(MF-PPO)使得现有的竞争者能够以较少的模型参数超越现有的竞争者,这是其普遍表现的关键。