多剂行为者 -- -- 批评合作方法的有闻名优势价值的政策规范化 (Policy Regularization with Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods)

Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, called Multi-agent PPO (MAPPO). However, previous literature shows that the vanilla MAPPO with a shared value function may not perform as well as Independent PPO (IPPO) and the finetuned QMIX. Thus MAPPO-agent-specific (MAPPO-AS) further improves the performance of vanilla MAPPO and IPPO by the artificial agent-specific features. In addition, there is no literature that gives a theoretical analysis of the working mechanism of MAPPO. In this paper, we firstly theoretically generalize single-agent PPO to the vanilla MAPPO, which shows that the vanilla MAPPO is approximately equivalent to optimizing a multi-agent joint policy with the original PPO. Secondly, we find that vanilla MAPPO faces the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)} as they learn policies by the sampled centralized advantage values. Then POMAC may lead to updating the policies of some agents in a suboptimal direction and prevent the agents from exploring better trajectories. To solve the POMAC problem, we propose a novel policy regularization method, i.e, Noisy-MAPPO, and Advantage-Noisy-MAPPO, which smooth out the advantage values by noise. The experimental results show that the average performance of Noisy-MAPPO is better than that of finetuned QMIX and MAPPO-AS, and is much better than the vanilla MAPPO. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.

翻译：多点强化学习(MARL)取得了革命性突破,它成功地应用到多试剂合作任务中,例如机器人群控、自动车辆协调以及计算机游戏。最近的工作对多试剂任务应用了Proximal政策优化(PPO),称为多试剂PPO(MAPPO )。然而,以前的文献显示,具有共同价值功能的Vanilla MAPPO可能不会像独立PPO(IPPO)和微调QMIX一样。因此,MAPPO(MAPPO)专门(MAPO)进一步提高了Vanilla MAPPO和IPPO(IPO)的性能。此外,没有任何文献对MAPO(PO)的工作机制进行理论分析。在本文中,我们首先从理论上将一个PAPOPO(PO)(PO)(PO(IPPO) (IPPPPO) (IPPO) (PPO) (PO) (PO) (PO-MOL) (M) (PO(MAPO) (PO) (PO) (PO) (PO) (PO) (O-PO) (O) (O) (O-POL) (O) (POL) (POL) (O(O(O) (O) (O) (POL) (POL) (POL) (OD) (O) (O) (O) (O) (O) (POL) (POL) (OD) (OD) (OD) (OD) (OD) (O(POL) (O) (O) (O(POL) (OD) (POL) (OD) (OD) (OD) (OD) (OD) (OD) (OD) (OD) (O) (O(POL) (OD) (O) (O) (OD) (OD) (O) (OD) (O) (O) (O) (POL) (OD) (O) (OD) (OD) (OD) (OD) (OD) (OD) (OD) (OD) (OD) (O