多试剂深增援人员学习的最大Corr最高值分解 (Maximum Correntropy Value Decomposition for Multi-agent Deep Reinforcemen Learning)

We explore value decomposition solutions for multi-agent deep reinforcement learning in the popular paradigm of centralized training with decentralized execution(CTDE). As the recognized best solution to CTDE, Weighted QMIX is cutting-edge on StarCraft Multi-agent Challenge (SMAC), with a weighting scheme implemented on QMIX to place more emphasis on the optimal joint actions. However, the fixed weight requires manual tuning according to the application scenarios, which painfully prevents Weighted QMIX from being used in broader engineering applications. In this paper, we first demonstrate the flaw of Weighted QMIX using an ordinary One-Step Matrix Game (OMG), that no matter how the weight is chosen, Weighted QMIX struggles to deal with non-monotonic value decomposition problems with a large variance of reward distributions. Then we characterize the problem of value decomposition as an Underfitting One-edged Robust Regression problem and make the first attempt to give a solution to the value decomposition problem from the perspective of information-theoretical learning. We introduce the Maximum Correntropy Criterion (MCC) as a cost function to dynamically adapt the weight to eliminate the effects of minimum in reward distributions. We simplify the implementation and propose a new algorithm called MCVD. A preliminary experiment conducted on OMG shows that MCVD could deal with non-monotonic value decomposition problems with a large tolerance of kernel bandwidth selection. Further experiments are carried out on Cooperative-Navigation and multiple SMAC scenarios, where MCVD exhibits unprecedented ease of implementation, broad applicability, and stability.

翻译：我们探索在集中化培训的流行模式下,通过分散执行(CTDE),在集中化培训(CTDE)的普及模式中,多剂深度强化学习的价值分解方案。作为CTDE的公认最佳解决方案,加权QMIX在StarCraft多剂挑战(SMAC)上处于领先地位,在QMIX上实施了一个加权计划,以更加强调最佳联合行动。然而,固定重量需要根据应用情景进行手工调整,这痛苦地防止了在更广泛的工程应用中使用重量化的QMIX。在本文中,我们首先通过普通的“一线化”矩阵游戏(OMG)展示了 Weight QMIX 的缺陷。无论重量是如何选择的, 加权QMIX 都在Starchet QMIX 的顶尖锐性分解问题中挣扎着。然后,我们将价值分解问题描述为一种不适应一面形的硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性的硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性硬性