We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or may not perform well in complex domains. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.
翻译:我们探索以分散执行(CTDE)的集中培训流行模式中的基于价值的多试剂强化学习(MARL)。 CTDE有一个重要的概念,即个人-全球-Max(IGM)原则,它要求联合和地方行动选择之间的一致性,以支持高效率的地方决策;然而,为了实现可扩缩性,现有的MARL方法要么限制其价值功能等级的表达性,要么放松IGM的连贯性,这可能会受到不稳定风险,或者在复杂领域可能表现不佳。本文介绍了一种新型的MARL方法,称为 duPLEX对多剂Q学习(QPLEX)的比对多剂Q(QPLEX)的比对多倍化(QPLEX)的比对网络结构,它采用双倍化的比对网络结构,将联合价值功能化。这种双倍的比对比结构将IGM原则纳入神经网络结构,从而使得有效的价值功能学习成为。理论分析表明,QPLEX实现了完整的I职能类别。关于StarCft II微观管理任务的实验表明,QPLEX在在线和离线式数据采集的更高效率数据收集中可以实现额外的效率。