Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning (MARL) methods with linear or monotonic value decomposition suffer from the relative overgeneralization. As a result, they can not ensure the optimal coordination. Existing methods address the relative overgeneralization by achieving complete expressiveness or learning a bias, which is insufficient to solve the problem. In this paper, we propose the optimal consistency, a criterion to evaluate the optimality of coordination. To achieve the optimal consistency, we introduce the True-Global-Max (TGM) principle for linear and monotonic value decomposition, where the TGM principle can be ensured when the optimal stable point is the unique stable point. Therefore, we propose the greedy-based value representation (GVR) to ensure the optimal stable point via inferior target shaping and eliminate the non-optimal stable points via superior experience replay. Theoretical proofs and empirical results demonstrate that our method can ensure the optimal consistency under sufficient exploration. In experiments on various benchmarks, GVR significantly outperforms state-of-the-art baselines.
翻译:由于联合Q值功能的代表性限制,具有线性或单体值分解的多试剂强化学习方法(MARL)受到相对过于笼统的影响,因此无法确保最佳协调。现有方法通过完全表达或学习偏差来解决相对过大的问题,这不足以解决问题。在本文件中,我们提出最佳一致性标准,用以评价协调的最佳性。为了达到最佳一致性,我们为线性和单体值分解引入了真-全球最大(TGM)原则,在最佳稳定点是独特稳定点时,可以确保TGM原则。因此,我们提议基于贪婪的价值代表(GVR),通过低级目标制成,并通过高级经验重演消除非最佳稳定点,确保最佳稳定点。理论证据和实证结果表明,我们的方法可以在充分探索下确保最佳一致性。在各种基准的实验中,GVR明显超越了最新基线。