深合作多机构合作加强学习中的价值分解比值 (Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning)

Value function decomposition is becoming a popular rule of thumb for scaling up multi-agent reinforcement learning (MARL) in cooperative games. For such a decomposition rule to hold, the assumption of the individual-global max (IGM) principle must be made; that is, the local maxima on the decomposed value function per every agent must amount to the global maximum on the joint value function. This principle, however, does not have to hold in general. As a result, the applicability of value decomposition algorithms is concealed and their corresponding convergence properties remain unknown. In this paper, we make the first effort to answer these questions. Specifically, we introduce the set of cooperative games in which the value decomposition methods find their validity, which is referred as decomposable games. In decomposable games, we theoretically prove that applying the multi-agent fitted Q-Iteration algorithm (MA-FQI) will lead to an optimal Q-function. In non-decomposable games, the estimated Q-function by MA-FQI can still converge to the optimum under the circumstance that the Q-function needs projecting into the decomposable function space at each iteration. In both settings, we consider value function representations by practical deep neural networks and derive their corresponding convergence rates. To summarize, our results, for the first time, offer theoretical insights for MARL practitioners in terms of when value decomposition algorithms converge and why they perform well.

翻译：价值函数分解正在成为合作游戏中推广多试剂加固学习( MARL) 的流行规则。要保持这种分解规则, 就必须设定个人- 全球最大( IGM) 原则; 也就是说, 每个代理商分解值函数的本地最大值必须达到共同值函数的全球最大值。但是, 这一原则一般而言并不必须保持。结果, 价值分解算法的适用性被隐藏起来, 其相应的趋同性属性仍然未知。在本文中, 我们首先努力回答这些问题。具体地说, 我们引入一套合作游戏, 其价值分解方法的原理是其有效性, 即, 被称作可分解的游戏。在可解析的游戏中, 我们理论上证明, 应用多剂的Q- Exceration 算法( MA- FQQI) 将带来最佳的功能。因此, 在不可分解的游戏中, MA- FQQI 的估算的理论性功能仍然会集中在一个最佳的环境下,, 也就是, 我们的递解析率的网络需要在每一个不同的空间结构中, 的演算结果。