Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX. Thus MAPPO-Feature-Pruned (MAPPO-FP) further improves the performance of MAPPO by the carefully designed artificial features. In addition, there is no literature that gives a theoretical analysis of the working mechanism of MAPPO. In this paper, we firstly theoretically generalize single-agent PPO to the vanilla MAPPO, which shows that the vanilla MAPPO is approximately equivalent to optimizing a multi-agent joint policy with the original PPO. Secondly, we find that MAPPO faces the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled centralized advantage values. Then POMAC may lead to updating the policies of some agents in a suboptimal direction and prevent the agents from exploring better trajectories. To solve the POMAC, we propose two novel policy perturbation methods, i.e, Noisy-Value MAPPO (NV-MAPPO) and Noisy-Advantage MAPPO (NA-MAPPO), which disturb the advantage values via random Gaussian noise. The experimental results show that the performance of our methods is better than that of Fine-tuned QMIX and MAPPO-FP, and achieves SOTA in Starcraft Multi-Agent Challenge (SMAC). We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.
翻译:多点强化学习(MARL)取得了革命性突破,它成功地应用到多试剂合作任务中,例如机器人暖化控制、自动车辆协调以及计算机游戏。最近的工作对多试剂任务应用了Proximal政策优化(PPO),以及具有集中价值功能的Vanilla多点试剂 PPO(MAPO)。然而,以前的文献显示,MAPPO可能不同时执行,独立PPO(IPPO)和调整的QMIX。因此,MAPPO-Faty-Pruned(MAPO-PO-FPI)通过精心设计的人工特性,进一步提高了MAPO的性能。此外,没有任何文献对MAPO的工作机制进行理论分析。在本文中,我们首先从理论上将单一试剂POPO(MAPO)的POPOPO PO(OMOMO) 的性能能比我们原PPPPO(O)的多点联合政策更好。我们发现,MAPPO-PO-PO(O-PO-PO-PO-PO) 政策的两部的优势比 政策更了解了。