Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. In this paper, we propose Noisy-MAPPO, which achieves more than 90\% winning rates in all StarCraft Multi-agent Challenge (SMAC) scenarios. First, we theoretically generalize Proximal Policy Optimization (PPO) to Multi-agent PPO (MAPPO) by lower bound of Trust Region Policy Optimization (TRPO). However, we find the shared advantage values in such MAPPO objective function may mislead the learning of some agents, which are not related to these advantage values, called \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}. Therefore, we propose noise advatange-value methods (Noisy-MAPPO and Advantage-Noisy-MAPPO) to solve this problem. The experimental results show that our random noise method improves the performance of vanilla MAPPO by 80\% in some Super-Hard scenarios in SMAC. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.
翻译:多机构强化学习(MARL)取得了革命性突破,成功地应用于多机构合作任务,如机器人群控、自动车辆协调、计算机游戏等。在本文中,我们提议Noisy-MAPOPO, 在所有StarCraft多剂挑战(SMAC)方案(SMAC)中,达到超过90 ⁇ 的得分率。首先,我们理论上通过信任区域政策优化(TRPO)的较低约束,将普罗西马政策优化(PPPO)推广到多机构PPPO(MAPPO)。然而,我们发现,在这种MAPO目标功能中的共享优势值可能会误导一些与这些优势值无关的代理商的学习,称为\textit{MAPPAPPO) 。因此,我们建议采用噪音advantage-valge方法(Nosy-MAPO和Advantage-Noisy-MAPPO)解决这个问题。实验结果表明,我们随机的噪音方法可以改进Vanilla MAPAPPO的性表现,在S-Harbz/AppI/SUD/SUPUPO之间,在SMACI/S-AppOL/S。我们开源代码中,在SUPUPO/O/O。