The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. However, after years of sustained improvement on SMAC, algorithms now achieve near-perfect performance. In this work, we conduct new analysis demonstrating that SMAC is not sufficiently stochastic to require complex closed-loop policies. In particular, we show that an open-loop policy conditioned only on the timestep can achieve non-trivial win rates for many SMAC scenarios. To address this limitation, we introduce SMACv2, a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation. We show that these changes ensure the benchmark requires the use of closed-loop policies. We evaluate state-of-the-art algorithms on SMACv2 and show that it presents significant challenges not present in the original benchmark. Our analysis illustrates that SMACv2 addresses the discovered deficiencies of SMAC and can help benchmark the next generation of MARL methods. Videos of training are available at https://sites.google.com/view/smacv2
翻译:具有挑战性的基准的可用性在最近机器学习的进展中发挥了关键作用。在合作性多试剂强化学习中,StarCraft多机构挑战(SMAC)已成为集中培训的流行测试场所,分散执行。然而,在SMAC多年持续改进后,算法现在达到近乎完美的业绩。在这项工作中,我们进行新的分析,表明SMAC没有足够随机性,无法要求复杂的闭路政策。特别是,我们显示,只有按时间步骤制定的开放通道政策,才能达到许多SMAC情景的非三角双赢率。为了应对这一限制,我们引入SMACv2,这是在程序上产生情景并要求代理人在评估期间向先前的无形环境(从同一分布上)进行概括的新的基准版本。我们表明,这些变化确保基准需要使用闭路政策。我们评估SMACVV2的状态-艺术算法,并表明它在最初的基准中不存在重大的挑战。我们的分析表明,SMACVV2处理SMAC2所发现的SMACMAC/MACMAR2的生成方法的缺陷。我们的分析表明,可以作为MACMACMACMACMACMs/s/s/MSings/MARs/CVVssss/Cow的生成方法的基准。