Value-based methods of multi-agent reinforcement learning (MARL), especially the value decomposition methods, have been demonstrated on a range of challenging cooperative tasks. However, current methods pay little attention to the interaction between agents, which is essential to teamwork in games or real life. This limits the efficiency of value-based MARL algorithms in the two aspects: collaborative exploration and value function estimation. In this paper, we propose a novel cooperative MARL algorithm named as interactive actor-critic~(IAC), which models the interaction of agents from the perspectives of policy and value function. On the policy side, a multi-agent joint stochastic policy is introduced by adopting a collaborative exploration module, which is trained by maximizing the entropy-regularized expected return. On the value side, we use the shared attention mechanism to estimate the value function of each agent, which takes the impact of the teammates into consideration. At the implementation level, we extend the value decomposition methods to continuous control tasks and evaluate IAC on benchmark tasks including classic control and multi-agent particle environments. Experimental results indicate that our method outperforms the state-of-the-art approaches and achieves better performance in terms of cooperation.
翻译:多试剂强化学习(MARL)的基于价值的方法,特别是价值分解方法,已在一系列具有挑战性的合作任务中得到证明,然而,目前的方法很少注意代理人之间的互动,而这种互动对于游戏或现实生活中的团队合作至关重要。这限制了基于价值的MARL算法在两个方面的效率:合作探索和价值函数估计。我们在本文件中提议了一个名为互动行动者-critic~(IAC)的新型合作MARL算法,它从政策和价值功能的角度来模拟代理人的互动。在政策方面,采用多试剂联合研究政策,采用合作探索模块,通过最大限度地实现诱变的预期回报来培训。在价值方面,我们使用共同注意机制来估计每种代理人的价值功能,其中考虑到队友的影响。在执行一级,我们将价值分解法扩大到持续控制任务,并评价IAC的基准任务,包括经典控制和多试剂粒子环境。实验结果表明,我们的方法在业绩方面优于状态合作方式。