Entropy regularization is a popular method in reinforcement learning (RL). Although it has many advantages, it alters the RL objective and makes the converged policy deviate from the optimal policy of the original Markov Decision Process. Though divergence regularization has been proposed to settle this problem, it cannot be trivially applied to cooperative multi-agent reinforcement learning (MARL). In this paper, we investigate divergence regularization in cooperative MARL and propose a novel off-policy cooperative MARL framework, divergence-regularized multi-agent actor-critic (DMAC). Mathematically, we derive the update rule of DMAC which is naturally off-policy, guarantees a monotonic policy improvement and is not biased by the regularization. DMAC is a flexible framework and can be combined with many existing MARL algorithms. We evaluate DMAC in a didactic stochastic game and StarCraft Multi-Agent Challenge and empirically show that DMAC substantially improves the performance of existing MARL algorithms.
翻译:强化学习(RL)是一个受欢迎的方法。虽然它有许多优点,但它改变了RL目标,并使趋同政策偏离了最初的Markov决定程序的最佳政策。虽然提出了差异正常化以解决该问题,但不能轻而易举地适用于合作性多剂强化学习(MARL)。在本文中,我们调查了合作性MARL的分歧正常化,并提出了一个新的非政策性合作性MARL框架、差异-正规化多剂行为者――批评(DMAC)。从数学角度讲,我们得出了DMAC的更新规则,该规则自然地脱离政策,保证单调政策改进,不受正规化的偏向。DMAC是一个灵活的框架,可以与现有的许多MARL算法结合。我们用一个实验性游戏和StarCraft多剂挑战来评价DMACCMAC,并用经验显示DMAC大大改进了现有的MARL算法的性能。