Entropy regularization is a popular method in reinforcement learning (RL). Although it has many advantages, it alters the RL objective of the original Markov Decision Process (MDP). Though divergence regularization has been proposed to settle this problem, it cannot be trivially applied to cooperative multi-agent reinforcement learning (MARL). In this paper, we investigate divergence regularization in cooperative MARL and propose a novel off-policy cooperative MARL framework, divergence-regularized multi-agent actor-critic (DMAC). Theoretically, we derive the update rule of DMAC which is naturally off-policy and guarantees monotonic policy improvement and convergence in both the original MDP and divergence-regularized MDP. We also give a bound of the discrepancy between the converged policy and optimal policy in the original MDP. DMAC is a flexible framework and can be combined with many existing MARL algorithms. Empirically, we evaluate DMAC in a didactic stochastic game and StarCraft Multi-Agent Challenge and show that DMAC substantially improves the performance of existing MARL algorithms.
翻译:虽然它有许多优点,但它改变了最初的Markov决定程序(MDP)的RL目标。虽然提出了分歧的正规化以解决该问题,但不能轻而易举地适用于合作性多剂强化学习(MARL)。在本文中,我们调查了合作性MARL的分歧正规化,并提出了一个新的非政策性MARL合作框架、差异性化多剂行为者――行为者――行为者(DMAC)。理论上,我们得出了DMAC的最新规则,这自然是脱离政策的,保证了最初的MDP和分解性MDP的单一性政策改进和趋同。我们还把趋同性政策与最初的MDP的最佳政策之间的差异拉了起来。DMAC是一个灵活的框架,可以与许多现有的MARL算法相结合。我们从时间上评价DMAC在战术性沙变游戏和StarCraft多重挑战中的DMACC,并表明DMAC大大改进了现有的MARL算法的性。