NQMIX: 深多机构强化学习的非分子值函数因子化 (NQMIX: Non-monotonic Value Function Factorization for Deep Multi-Agent Reinforcement Learning)

Multi-agent value-based approaches recently make great progress, especially value decomposition methods. However, there are still a lot of limitations in value function factorization. In VDN, the joint action-value function is the sum of per-agent action-value function while the joint action-value function of QMIX is the monotonic mixing of per-agent action-value function. To some extent, QTRAN reduces the limitation of joint action-value functions that can be represented, but it has unsatisfied performance in complex tasks. In this paper, in order to extend the class of joint value functions that can be represented, we propose a novel actor-critic method called NQMIX. NQMIX introduces an off-policy policy gradient on QMIX and modify its network architecture, which can remove the monotonicity constraint of QMIX and implement a non-monotonic value function factorization for the joint action-value function. In addition, NQMIX takes the state-value as the learning target, which overcomes the problem in QMIX that the learning target is overestimated. Furthermore, NQMIX can be extended to continuous action space settings by introducing deterministic policy gradient on itself. Finally, we evaluate our actor-critic methods on SMAC domain, and show that it has a stronger performance than COMA and QMIX on complex maps with heterogeneous agent types. In addition, our ablation results show that our modification of mixer is effective.

翻译：多试剂价值基础方法最近取得了很大进展,特别是价值分解方法。然而,在价值函数分化方面仍然存在许多限制。在 VDN 中,联合行动值函数是每个试剂行动价值函数的总和,而QMIX 的联合行动值函数是每个试剂行动价值函数的单体混合。在某种程度上,QTRAN 减少了可以代表的联合行动价值功能的局限性,但在复杂任务中它的表现不尽人意。在本文件中,为了扩大可代表的合值函数类别,我们提议了一个叫NQMIX的新型行为者-critic 方法。NQMIX 引入了一种非政策性的政策梯度,并修改了其网络结构,这可以消除每试一剂行动价值功能的单一性制约,并实施了联合行动价值函数的非分子值因子化。此外,NQMIX 将国家价值作为学习目标,从而克服了QMIX中的问题,使得学习目标在QQMIX 中呈现出一种更强烈的SMIC值,我们学习的域域域域级变变结果本身就展示了一种更坚定的SMMLA行动方法。最后,我们用SMQQ 展示了一个更坚定的SMA 。