In cooperative multi-agent reinforcement learning, centralized training and decentralized execution (CTDE) has achieved remarkable success. Individual Global Max (IGM) decomposition, which is an important element of CTDE, measures the consistency between local and joint policies. The majority of IGM-based research focuses on how to establish this consistent relationship, but little attention has been paid to examining IGM's potential flaws. In this work, we reveal that the IGM condition is a lossy decomposition, and the error of lossy decomposition will accumulated in hypernetwork-based methods. To address the above issue, we propose to adopt an imitation learning strategy to separate the lossy decomposition from Bellman iterations, thereby avoiding error accumulation. The proposed strategy is theoretically proved and empirically verified on the StarCraft Multi-Agent Challenge benchmark problem with zero sight view. The results also confirm that the proposed method outperforms state-of-the-art IGM-based approaches.
翻译:在合作性多试剂强化学习、集中培训和分散执行(CTDE)方面,取得了显著成功。作为CTDE重要要素之一的单个全球最大分解(IGM)测量了地方政策和联合政策的一致性。基于IGM的大多数研究侧重于如何建立这种一致的关系,但很少注意检查IGM的潜在缺陷。在这项工作中,我们发现IGM的状况是一种损失分解,损失分解的错误将积累在超网络方法中。为了解决上述问题,我们提议采用模拟学习战略,将损失分解与Bellman重复分离,从而避免错误积累。拟议的战略在理论上证明并在实验上核实了StarCraft多点挑战基准问题,但没有看到。结果还证实拟议的方法超越了基于IGM的最新方法。