Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper, we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB) that achieves the minimal variance. In comparison to the OB, we measure the excess variance of existing MARL algorithms such as vanilla MAPG and COMA. Considering using deep neural networks, we also propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL. On benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique effectively stabilises training and improves the performance of multi-agent PPO and COMA algorithms by a significant margin.
翻译:在多试剂RL(MARL)中,尽管PG理论可以自然地扩展,但多试剂PG(MAPG)方法的效力随着梯度估计数的差异随着物剂数量的迅速增加而降低。在本文件中,我们对MAPG方法进行严格分析,首先通过量化物剂数量和物剂勘探对MAPG估计数字差异的贡献,对MAPG方法进行量化。根据这一分析,我们得出了达到最小差异的最佳基线(OB),与OB相比,我们衡量现有的ML算法(如Vanilla MAPG和COMA)的超值差异。考虑到使用深层神经网络,我们还提议了一个OB的替代版本,该版本可以无缝地插入MARL的任何现有的PG方法。关于多剂MuJoco和StarCraft基准,我们的OB技术有效稳定化培训,并通过一个显著的磁差式PPPO和COMA算法改进了多试PO和COMA的性差值。