Malib:基于人口的多机构强化学习平行框架 (MALib: A Parallel Framework for Population-based Multi-agent Reinforcement Learning)

Population-based multi-agent reinforcement learning (PB-MARL) refers to the series of methods nested with reinforcement learning (RL) algorithms, which produces a self-generated sequence of tasks arising from the coupled population dynamics. By leveraging auto-curricula to induce a population of distinct emergent strategies, PB-MARL has achieved impressive success in tackling multi-agent tasks. Despite remarkable prior arts of distributed RL frameworks, PB-MARL poses new challenges for parallelizing the training frameworks due to the additional complexity of multiple nested workloads between sampling, training and evaluation involved with heterogeneous policy interactions. To solve these problems, we present MALib, a scalable and efficient computing framework for PB-MARL. Our framework is comprised of three key components: (1) a centralized task dispatching model, which supports the self-generated tasks and scalable training with heterogeneous policy combinations; (2) a programming architecture named Actor-Evaluator-Learner, which achieves high parallelism for both training and sampling, and meets the evaluation requirement of auto-curriculum learning; (3) a higher-level abstraction of MARL training paradigms, which enables efficient code reuse and flexible deployments on different distributed computing paradigms. Experiments on a series of complex tasks such as multi-agent Atari Games show that MALib achieves throughput higher than 40K FPS on a single machine with $32$ CPU cores; 5x speedup than RLlib and at least 3x speedup than OpenSpiel in multi-agent training tasks. MALib is publicly available at https://github.com/sjtu-marl/malib.

翻译：以人口为基础的多试剂强化学习(PB-MARL)是指一系列方法,这些方法以强化学习(RL)算法嵌套起来,产生由人口动态结合产生的任务的自发序列。通过利用自动曲解来诱发不同的突发战略,PB-MARL在应对多试任务方面取得了令人瞩目的成功。尽管先前分发的RL框架具有非凡的艺术,但PB-MARL对平行培训框架提出了新的挑战,因为涉及不同政策互动的抽样、培训和评价之间多重嵌套工作量的复杂程度。为了解决这些问题,我们为PB-MARL提出了一个自我生成的任务序列。我们的框架由三个关键组成部分组成:(1) 集中任务发送模式,支持自发任务和与复杂政策组合的可缩缩缩放培训;(2) 名为Actoror-Evaluator-Learner的编程架构,在培训和采样中都实现了高度的相近平级的同步性,并满足了自动曲学习的评价要求;(3) 在MAR-MAR-MAR-MAR-MLML的多级培训模型模型中,在可高效的版本中,在多式的版本中,在多式的SAL-MAL-MAL-Rial-Rial-ral-licreal-lial-lial-real-real-real-real-real-lishal-lishmal上,在可提供中可以使用一个高效的编程上,在使用一个可再利用的编程上展示式上展示式上展示式上,在可操作式的编程上,在5-lical-lical-real-real-real-real-lial-lial-lial-li-lial-lial-li-lial-li-li-li-li-li-li-li-li-li-li-li-lial-lial-lial-lial-lial-lial-lial-lial-lial-lial-lial-lad-li-li-li-li-li-li-li-li-li-li-li-lial-lad-la-la-la-li-la-la-la-la-li-li