Deep reinforcement learning methods have shown great performance on many challenging cooperative multi-agent tasks. Two main promising research directions are multi-agent value function decomposition and multi-agent policy gradients. In this paper, we propose a new decomposed multi-agent soft actor-critic (mSAC) method, which effectively combines the advantages of the aforementioned two methods. The main modules include decomposed Q network architecture, discrete probabilistic policy and counterfactual advantage function (optinal). Theoretically, mSAC supports efficient off-policy learning and addresses credit assignment problem partially in both discrete and continuous action spaces. Tested on StarCraft II micromanagement cooperative multiagent benchmark, we empirically investigate the performance of mSAC against its variants and analyze the effects of the different components. Experimental results demonstrate that mSAC significantly outperforms policy-based approach COMA, and achieves competitive results with SOTA value-based approach Qmix on most tasks in terms of asymptotic perfomance metric. In addition, mSAC achieves pretty good results on large action space tasks, such as 2c_vs_64zg and MMM2.
翻译:深度强化学习方法在许多具有挑战性的多试剂合作性任务方面表现良好。两大有希望的主要研究方向是多试剂价值功能分解和多试剂政策梯度。在本文件中,我们提出了一个新的分解多试剂软性作用器(mSAC)方法,该方法有效地结合了上述两种方法的优势。主要模块包括分解的Q网络结构、离散的概率政策和反事实优势功能(可操作性)。理论上,MSAC支持高效的离散政策学习,并部分解决离散和连续行动空间的信用分配问题。在StarCraft II微观管理合作性多试度基准上测试了我们的经验性调查MSAC相对于其变量的绩效,并分析了不同组成部分的效果。实验结果表明,MSCAC大大超越了以政策为基础的COMA方法,在以SOTA价值为基础的方法上取得了竞争性结果。在无防腐蚀性常识度测量方面大多数任务。此外,MMM2和Mz2等大型空间任务取得了相当好的成绩。