Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, the Hanabi challenge, and Google Research Football, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to strong off-policy methods, PPO often achieves competitive or superior results in both final rewards and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods are a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at https://github.com/marlbenchmark/on-policy.
翻译:最佳政策优化(PPO)是多试剂环境中普遍存在的政策强化学习算法,但利用率明显低于多试剂环境中的非政策学习算法,原因往往是认为多试剂系统中PPO的抽样效率大大低于非政策方法。在这项工作中,我们仔细研究多试剂合作环境中PPO的绩效。我们显示,基于PPO的多试剂算法在四个广受欢迎的多试剂试样中取得了惊人的出色性能:粒子-世界环境、StarCraft多试剂挑战、Hanabi挑战和Google研究足球,这些方法只有最低限度的超参数调整,而且没有任何特定领域的方法或结构。重要的是,与强有力的非政策方法相比,PPPO往往在最后奖励和抽样效率两方面都取得竞争或优异的结果。最后,我们通过通货膨胀研究,分析执行情况和对PPPO的实证业绩至关重要的超参数,并就这些因素提出具体的实际建议。我们的结果显示,在使用这些做法时,简单的PPO-bot多试剂方法是合作性多试剂学习法的强有力基线。