Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, called Multi-agent PPO (MAPPO). However, the MAPPO in current works lacks theoretical support, and requires artificial agent-specific features, called MAPPO-agent-specific (MAPPO-AS). In addition, the performance of MAPPO-AS is still lower than the finetuned QMIX on the popular benchmark environment StarCraft Multi-agent Challenge (SMAC). In this paper, we firstly theoretically generalize single-agent PPO to the vanilla MAPPO, which shows that the vanilla MAPPO is equivalent to optimizing a multi-agent joint policy with the original PPO approximately. Secondly, since the centralized advantages function in vanilla MAPPO lacks a credit allocation mechanism, which may lead to updating the policies of some agents in a suboptimal direction. Then this problem may prevent the agents from exploring better trajectories, called \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}. To solve the POMAC, we propose the Noisy Advantage-Values (Noisy-MAPPO and Advantage-Noisy-MAPPO) which smooth out the advantage values, likewise label smoothing. The experimental results show that the average performance of Noisy-MAPPO is better than that of finetuned QMIX and MAPPO-AS, and is much better than the vanilla MAPPO. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.
翻译:多用途强化学习(MARL)取得了革命性突破,它成功地应用到多用途合作任务中,比如机器人群群控、自动车辆协调和计算机游戏。最近的工作对多用途任务应用了Proximal政策优化(PPO),称为多用途PPO(MAPO MAPO ) 。然而,当前工作的MAPO缺乏理论支持,需要人工剂专用功能,称为MAPOPO(MAPO)专用试剂(MAPO-AS)。此外,MAAPO-AS的性能仍然低于在流行基准环境StarCraft多用途工具挑战(SMAAC)上对QMIX的微调。在这篇文章中,我们首先理论上将一个POPO(POPO ) 的单一剂优化到Vanilla POPO(MPO) 。 Vanilla MAPPO(MAPO) 的中央优势功能缺乏一个公开的信用分配机制,这可能会更新某些代理人在亚型基准环境中的QMIX(S-MADA) 高级操作优势。随后, 这个问题可能让MAC(NOPO(O) 高级政策) 更好取代了我们的PO(NODUPO) 。