Value decomposition multi-agent reinforcement learning methods learn the global value function as a mixing of each agent's individual utility functions. Coordination graphs (CGs) represent a higher-order decomposition by incorporating pairwise payoff functions and thus is supposed to have a more powerful representational capacity. However, CGs decompose the global value function linearly over local value functions, severely limiting the complexity of the value function class that can be represented. In this paper, we propose the first non-linear coordination graph by extending CG value decomposition beyond the linear case. One major challenge is to conduct greedy action selections in this new function class to which commonly adopted DCOP algorithms are no longer applicable. We study how to solve this problem when mixing networks with LeakyReLU activation are used. An enumeration method with a global optimality guarantee is proposed and motivates an efficient iterative optimization method with a local optimality guarantee. We find that our method can achieve superior performance on challenging multi-agent coordination tasks like MACO.
翻译:多剂强化的数值分解学习方法学习全球价值功能,将每种物剂的个别效用功能混合在一起。协调图表(CGs)通过纳入双向报酬功能,代表了更高层次的分解,因此应该具有更强大的代表能力。然而,CGs将全球价值函数的线性分解超过当地价值功能,严重限制了所代表价值功能类别的复杂性。在本文件中,我们提议第一个非线性协调图,将CG值分解范围扩大到线性案例之外。一个重大挑战是在这一新功能类别中进行贪婪的行动选择,而通常采用的DCOP算法已不再适用。我们研究如何在将网络与LekyReLU的激活混合时解决这一问题。提出了具有全球最佳性保证的计数方法,并激励一种具有地方最佳性保证的高效迭代优化方法。我们发现,我们的方法可以在挑战性多剂协调任务(如MACO)上取得优异性业绩。