Value factorization is a popular and promising approach to scaling up multi-agent reinforcement learning in cooperative settings. However, the theoretical understanding of such methods is limited. In this paper, we formalize a multi-agent fitted Q-iteration framework for analyzing factorized multi-agent Q-learning. Based on this framework, we investigate linear value factorization and reveal that multi-agent Q-learning with this simple decomposition implicitly realizes a powerful counterfactual credit assignment, but may not converge in some settings. Through further analysis, we find that on-policy training or richer joint value function classes can improve its local or global convergence properties, respectively. Finally, to support and extend our theoretical implications to practical realization, we conduct an empirical analysis of state-of-the-art deep multi-agent Q-learning algorithms on didactic examples and a broad set of StarCraft II unit micromanagement tasks.
翻译:价值因素化是一种在合作环境中推广多剂强化学习的流行和有希望的方法。然而,对这种方法的理论理解是有限的。在本文件中,我们正式确定了一个多剂装配的用于分析多剂化多剂化的Q学习的昆虫化框架。基于这个框架,我们调查线性价值因素化,并揭示,以这种简单分解的多剂Q学习隐含着一种强大的反事实信用分配,但在某些环境中可能无法汇合。通过进一步的分析,我们发现,在政策培训或更丰富的联合价值功能班可以分别改善其地方或全球趋同特性。最后,为了支持和扩展我们的理论影响,以实际实现,我们对最先进的多剂深层次的多剂学习算法进行了实验性分析,以实验性实例和一系列广泛的StarCraft II单元微观管理任务为基础。