Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration.
翻译:由于联合Q值功能的表示限制,具有线性分解(LVD)或单体值分解(MVD)的多试剂强化学习方法相对过于笼统,因此无法确保最佳一致性(即个人贪婪行动与最大真实Q值之间的对应关系),在本文件中,我们得出LVD和MVD联合Q值函数的表示方式。根据表达方式,我们绘制一个过渡图,其中每个自转节点都是可能的汇合点。为了确保最佳一致性,需要最佳节点才能成为独特的STN。因此,我们提议基于贪婪的表示方式(GVR),它将最佳节点变成STN,通过低级目标制成,并通过高级经验再演进一步消除非最佳的STN。此外,GVR在最佳性和稳定性之间实现了适应性交易。我们的方法在各种基准的实验中超越了最先进的基准。关于GVR的探索的理论证据和实验结果显示,GVR有足够的最佳一致性。