Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent cooperative tasks, such as Independent PPO (IPPO); and vanilla Multi-agent PPO (MAPPO) which has a centralized value function. However, previous literature shows that MAPPO may not perform as well as Independent PPO (IPPO) and the Fine-tuned QMIX on Starcraft Multi-Agent Challenge (SMAC). MAPPO-Feature-Pruned (MAPPO-FP) improves the performance of MAPPO by the carefully designed agent-specific features, which may be not friendly to algorithmic utility. By contrast, we find that MAPPO may face the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}, as they learn policies by the sampled advantage values. Then POMAC may lead to updating the multi-agent policies in a suboptimal direction and prevent the agents from exploring better trajectories. In this paper, to mitigate the multi-agent policies overfitting, we propose a novel policy regularization method, which disturbs the advantage values via random Gaussian noise. The experimental results show that our method outperforms the Fine-tuned QMIX, MAPPO-FP, and achieves SOTA on SMAC without agent-specific features. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.
翻译:近期的著作应用了Proximal政策优化(PPO)来完成多试剂合作任务,如独立PPO(IPPO)和具有集中价值功能的香草多试PPO(MAPPO),然而,以前的文献表明,MAPPO可能无法同时执行独立PPPO(IPPO)和关于星际手工业多点挑战(SMAAC)的微调QMIX。 MAPPO-Fature-Pruned(MAPO-FPO-Pruned(MAPO-FPO-PO-PRO)通过精心设计的特制代理特效来改进MAPPO的绩效,这可能不利于算法工具的功能。相比之下,我们发现MAAPPO可能面临多点合作(POMAC)中过度适应的政策问题,因为他们通过抽样优势值来学习政策。 之后,PAPMAC可能会以亚优性方向更新多点政策,防止代理商探索更好的轨迹。在本文中,我们建议一种创新的政策规范方法,通过随机的SIMFA(S-FA) 显示SIMFAC)的优势。