Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with no impact on the quality of training. Furthermore, our algorithm is theoretically grounded - we prove a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them.
翻译:多试剂强化学习(MARL)为涉及多个互动代理器的问题提供了一个框架。 尽管多试剂问题与单一试剂案例明显相似, 但从理论上来说, 多试剂问题往往更难于培训和分析。 在这项工作中,我们提出MA- Trace,这是一个新的政策性行为者-批评算法,将V-Trace扩展至MARL设置。我们算法的主要优点是它在多工作器环境中的高度可缩放性。为此,MA-Trace将重要取样作为一种不受政策约束的校正方法,允许分配计算结果,对培训质量没有影响。此外,我们的算法在理论上是有根据的,我们证明是保证趋同的固定点标码。我们广泛评价StarCraft多试算法的算法,这是多试算法的标准基准。 MA-Trace在所有任务上都取得了很高的性能,超过了其中一些任务的最新结果。