In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.
翻译:在许多现实世界环境中,一个代理人团队必须在以分散方式行事的同时协调其行为。 与此同时,往往有可能在模拟或实验室环境中以集中方式对代理人进行培训,在模拟或实验室环境中,进行全球状态信息可用和通信限制解除。学习以额外国家信息为条件的联合行动价值是利用集中学习的一种有吸引力的方法,但随后提取分散政策的最佳战略并不明确。我们的解决办法是QMIX,这是一种新型的基于价值的方法,可以以集中式端到端方式对分散政策进行培训。QMIX使用一个网络,将联合行动价值估计为每个代理人价值的复杂非线性组合,仅以当地观察为条件。我们从结构上强制执行,联合行动价值是每个代理人价值中的单项性,这样可以使联合行动价值在非集中式学习中得到可协调的最大化,并保证集中式和分散式政策之间的一致性。我们评估一套具有挑战性的StarCraft II微管理任务中的QMIX,并显示基于QMIX的多级强化式学习方法。