Value factorization is a popular and promising approach to scaling up multi-agent reinforcement learning in cooperative settings, which balances the learning scalability and the representational capacity of value functions. However, the theoretical understanding of such methods is limited. In this paper, we formalize a multi-agent fitted Q-iteration framework for analyzing factorized multi-agent Q-learning. Based on this framework, we investigate linear value factorization and reveal that multi-agent Q-learning with this simple decomposition implicitly realizes a powerful counterfactual credit assignment, but may not converge in some settings. Through further analysis, we find that on-policy training or richer joint value function classes can improve its local or global convergence properties, respectively. Finally, to support our theoretical implications in practical realization, we conduct an empirical analysis of state-of-the-art deep multi-agent Q-learning algorithms on didactic examples and a broad set of StarCraft II unit micromanagement tasks.
翻译:价值要素化是一种在合作环境中推广多剂强化学习的流行和有希望的方法,它平衡了学习的可扩展性和价值功能的代表性能力。然而,对这种方法的理论理解有限。在本文件中,我们正式确定了一个多剂装配的用于分析因素化多剂剂Q学习的定量框架。根据这个框架,我们调查线性价值系数化,并揭示,多剂Q学习以这种简单的分解,暗含着一种强大的反实际信用分配,但在某些环境中可能无法汇合。通过进一步的分析,我们发现关于政策培训或更丰富的联合价值功能班可以分别改善其地方或全球趋同特性。最后,为了支持我们实际实现的理论意义,我们就实验性实例和一系列广泛的StarCraft II单元微观管理任务,对最先进的多剂深层次的深层次研究算法进行了实性分析。