We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.
翻译:我们提出的信任区域是合作性多机构强化学习(MARL)中优化分散化政策的界限,即使过渡动态是非静止的,这种新的分析从理论上理解了MARL最近两种行为者-批评性方法的强劲表现,这两种方法都依赖独立比率,即分别计算每个代理人的政策的概率比率。我们表明,尽管独立比率造成非静止性,但单调式改进保证仍然产生于对所有分散化政策强制实施信任区域限制的结果。我们还表明,根据培训人员的数量限制独立比率,从而可以有原则地有效执行这种信任区域限制,为截断准比率提供理论基础。最后,我们的经验结果支持这样的假设,即IPPO和MAPO的强大表现是通过集中化培训剪裁执行这种信任区域限制的直接结果,以及按照我们理论分析的预测,调整关于代理人数目的超参数。