通过合作性多剂行为者-批评方法的 " 闻名 " 有利价值影响政策 (Policy Perturbation via Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods)

Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX. Thus MAPPO-Feature-Pruned (MAPPO-FP) further improves the performance of MAPPO by the carefully designed artificial features. In addition, there is no literature that gives a theoretical analysis of the working mechanism of MAPPO. In this paper, we firstly theoretically generalize single-agent PPO to the MAPPO, which shows that the MAPPO is approximately equivalent to optimizing a multi-agent joint policy with the original PPO. Secondly, we find that MAPPO faces the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled centralized advantage values. Then POMAC may lead to updating the policies of some agents in a suboptimal direction and prevent the agents from exploring better trajectories. To solve the POMAC, we propose two novel policy perturbation methods, i.e, Noisy-Value MAPPO (NV-MAPPO) and Noisy-Advantage MAPPO (NA-MAPPO), which disturb the advantage values via random Gaussian noise. The experimental results show that the performance of our methods is better than that of Fine-tuned QMIX and MAPPO-FP, and achieves SOTA in Starcraft Multi-Agent Challenge (SMAC). We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.

翻译：多点加固学习(MAL)取得了革命性突破,它成功地应用到多试剂合作任务中,例如机器人暖流控制、自动车辆协调以及计算机游戏。最近的一些作品应用了“预星政策优化”(PPO)来执行多试剂任务,例如独立PPO(IPPO);和具有集中价值功能的香草多剂PPO(MAPPO) 。然而,以前的文献显示,MAPPO可能不同时执行,独立POP(IPPO)和调准的QMIX。因此,MAPO-Fature-Pruned(MAPO)通过精心设计的人工特征,进一步提高MAPO的性能。此外,没有任何文献对MAPO的工作机制进行理论分析。我们首先从理论上将单剂POPO(MPO) 向MAPOPO(MO) 展示了一个多点的性能联合政策。我们发现,MAPPO-MA(O-MOL) 和O-PO(O-PO-PO-PO-POL) 某些的性政策比起来, 更能政策的优势。