Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at \url{https://github.com/marlbenchmark/on-policy}.
翻译:最佳政策优化(PPO)是多试剂环境中普遍存在的政策强化学习算法,但利用率明显低于多试剂环境中的非政策学习算法,原因往往是认为多试剂系统中PPO的抽样效率大大低于非政策方法。在这项工作中,我们仔细研究多试剂合作环境中PPO的绩效。我们显示,基于PPO的多试剂算法在四个广受欢迎的多试剂试样中取得了惊人的强效:粒子-世界环境、StarCraft多试剂挑战、Google研究足球和Hanabi挑战,只有最低限度的超参数调和没有任何特定领域的方法或结构。重要的是,与竞争性非政策方法相比,PPPO往往在最后回报和抽样效率两方面都取得竞争或优异的结果。最后,通过通货膨胀研究,我们分析执行情况和对PPPO的实证业绩至关重要的超参数,并就这些因素提出具体的实际建议。我们的结果表明,在使用这些做法时,简单的PPPO-GO-Research多剂方法可成为合作性准则中强有力的基准。