Value factorisation proves to be a very useful technique in multi-agent reinforcement learning (MARL), but the underlying mechanism is not yet fully understood. This paper explores a theoretic basis for value factorisation. We generalise the Shapley value in the coalitional game theory to a Markov convex game (MCG) and use it to guide value factorisation in MARL. We show that the generalised Shapley value possesses several features such as (1) accurate estimation of the maximum global value, (2) fairness in the factorisation of the global value, and (3) being sensitive to dummy agents. The proposed theory yields a new learning algorithm called Sharpley Q-learning (SHAQ), which inherits the important merits of ordinary Q-learning but extends it to MARL. In comparison with prior-arts, SHAQ has a much weaker assumption (MCG) that is more compatible with real-world problems, but has superior explainability and performance in many cases. We demonstrated SHAQ and verified the theoretic claims on Predator-Prey and StarCraft Multi-Agent Challenge (SMAC).
翻译:多试剂加固学习(MARL)中,价值因素化被证明是一种非常有用的技术,但基本机制尚未完全理解。本文探讨了价值因素化的理论基础。我们将联盟游戏理论中的沙普利值概括为Markov convex游戏(MCG),并用它来指导MAL的数值化。我们表明,一般的沙普利值具有以下几个特点:(1) 准确估计全球最大值,(2) 公平计算全球值,(3) 对假剂敏感。拟议的理论产生了一种新的学习算法,称为Sharpley Q-learning(SHAQ),它继承了普通Q-lear学习的重要优点,但将其扩展到了MARL。与以前的作品相比,SHAQ的假设(MCG)要弱得多,它更适合现实世界问题,但在许多案例中具有更高级的解释性和性。我们演示了SHAQ,并核实了对Pidator-Prey和Star-traft inter-Agenti-Agent-Agent Challen chire(SMAAC)。