Policy gradient methods are an attractive approach to multi-agent reinforcement learning problems due to their convergence properties and robustness in partially observable scenarios. However, there is a significant performance gap between state-of-the-art policy gradient and value-based methods on the popular StarCraft Multi-Agent Challenge (SMAC) benchmark. In this paper, we introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We enhance two state-of-the-art policy gradient algorithms with SOP training, demonstrating significant performance improvements. Furthermore, we show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
翻译:政策梯度方法对于多试剂强化学习问题具有吸引力,因为它们具有趋同特性,而且在部分可观测情景中具有强健性;然而,在最先进的政策梯度和流行的StarCraft多机构挑战(SMAC)基准的基于价值的方法之间存在显著的绩效差距;在本文件中,我们引入半政策培训,作为解决政策梯度方法的抽样低效率的有效和计算效率的方法;我们通过SOP培训,加强两种最先进的政策梯度算法,展示了显著的绩效改进;此外,我们表明,我们的方法在各种SMAC任务上的表现好于或优于最先进的基于价值的方法。