Value-decomposition methods, which reduce the difficulty of a multi-agent system by decomposing the joint state-action space into local observation-action spaces, have become popular in cooperative multi-agent reinforcement learning (MARL). However, value-decomposition methods still have the problems of tremendous sample consumption for training and lack of active exploration. In this paper, we propose a scalable value-decomposition exploration (SVDE) method, which includes a scalable training mechanism, intrinsic reward design, and explorative experience replay. The scalable training mechanism asynchronously decouples strategy learning with environmental interaction, so as to accelerate sample generation in a MapReduce manner. For the problem of lack of exploration, an intrinsic reward design and explorative experience replay are proposed, so as to enhance exploration to produce diverse samples and filter non-novel samples, respectively. Empirically, our method achieves the best performance on almost all maps compared to other popular algorithms in a set of StarCraft II micromanagement games. A data-efficiency experiment also shows the acceleration of SVDE for sample collection and policy convergence, and we demonstrate the effectiveness of factors in SVDE through a set of ablation experiments.
翻译:通过将国家联合行动空间分解为地方观测-行动空间,减少多试剂系统难度的数值分解方法,在多试剂强化合作学习中变得很受欢迎。然而,价值分解方法仍然存在着为培训而大量采样消耗和缺乏积极勘探的问题。在本文件中,我们提出一种可扩缩的数值分解勘探方法,其中包括一个可扩缩的培训机制、内在奖赏设计和探索经验重现。可伸缩的培训机制与环境互动学习的无同步脱钩战略,以加快以地图降色方式生成样本。关于缺乏勘探的问题,提出了内在的奖赏设计和探索性经验重现,以便分别加强勘探以产生不同的样品和过滤非新品样本。我们的方法在一套StarCraft II微管理游戏中几乎在所有地图上都取得了最佳的性能。一个数据效率实验还展示了SVDEV的加速率,通过一套样品采集和聚合政策实验,我们展示了SVV的加速了SV的加速率。</s>