Many reality tasks such as robot coordination can be naturally modelled as multi-agent cooperative system where the rewards are sparse. This paper focuses on learning decentralized policies for such tasks using sub-optimal demonstration. To learn the multi-agent cooperation effectively and tackle the sub-optimality of demonstration, a self-improving learning method is proposed: On the one hand, the centralized state-action values are initialized by the demonstration and updated by the learned decentralized policy to improve the sub-optimality. On the other hand, the Nash Equilibrium are found by the current state-action value and are used as a guide to learn the policy. The proposed method is evaluated on the combat RTS games which requires a high level of multi-agent cooperation. Extensive experimental results on various combat scenarios demonstrate that the proposed method can learn multi-agent cooperation effectively. It significantly outperforms many state-of-the-art demonstration based approaches.
翻译:机器人协调等许多现实任务可以自然地仿照多代理人合作系统,其回报微乎其微。本文件侧重于通过亚最佳示范学习分散化的政策。为了有效地学习多代理人合作并解决示范的亚最佳性,建议了一种自我改进学习方法:一方面,中央国家行动价值通过示范开始,然后通过学习的分散化政策加以更新,以改善亚优性。另一方面,Nash 平衡性是按当前状态行动价值发现的,并用作学习政策的指导。拟议的方法在战斗性RTS游戏上进行评估,这需要高水平的多代理人合作。关于各种战斗情景的广泛实验结果表明,拟议的方法可以有效地学习多代理人合作。它大大优于许多基于状态的示范方法。