Value function factorization via centralized training and decentralized execution is promising for solving cooperative multi-agent reinforcement tasks. One of the approaches in this area, QMIX, has become state-of-the-art and achieved the best performance on the StarCraft II micromanagement benchmark. However, the monotonic-mixing of per agent estimates in QMIX is known to restrict the joint action Q-values it can represent, as well as the insufficient global state information for single agent value function estimation, often resulting in suboptimality. To this end, we present LSF-SAC, a novel framework that features a variational inference-based information-sharing mechanism as extra state information to assist individual agents in the value function factorization. We demonstrate that such latent individual state information sharing can significantly expand the power of value function factorization, while fully decentralized execution can still be maintained in LSF-SAC through a soft-actor-critic design. We evaluate LSF-SAC on the StarCraft II micromanagement challenge and demonstrate that it outperforms several state-of-the-art methods in challenging collaborative tasks. We further set extensive ablation studies for locating the key factors accounting for its performance improvements. We believe that this new insight can lead to new local value estimation methods and variational deep learning algorithms. A demo video and code of implementation can be found at https://sites.google.com/view/sacmm.
翻译:价值函数分解通过集中式训练和去中心化执行来解决合作多智能体强化学习问题,已经成为有前途的方法之一。这个领域中一个称为QMIX的方法已成为最先进的技术,通过在StarCraft II微观管理基准测试上取得了最好的性能。然而,QMIX中每个单独智能体价值的单调混合受限制,其性能常常导致次优,因为不足的全局状态信息不足以支持单智能体价值函数的估计。基于这个问题,我们提出了LSF-SAC这一新框架。它采用基于变分推论的信息共享机制,作为附加状态信息来协助个体代理进行价值函数分解,通过软性演员-评论家设计,实现了完全去中心化执行。我们在StarCraft II微观管理挑战中评估了LSF-SAC,并证明它在具有挑战性的合作任务中胜过几个最先进的方法。我们进一步进行了详细的消融研究,以确定其性能提高的关键因素。我们相信这一新的结论可以指导新的局部价值估计方法和变分深度学习算法。实现的演示视频和代码可以在https://sites.google.com/view/sacmm中找到。