通过模拟和图图革命网络进行学习合作和在线规划 (Learning Cooperation and Online Planning Through Simulation and Graph Convolutional Network)

Multi-agent Markov Decision Process (MMDP) has been an effective way of modelling sequential decision making algorithms for multi-agent cooperative environments. A number of algorithms based on centralized and decentralized planning have been developed in this domain. However, dynamically changing environment, coupled with exponential size of the state and joint action space, make it difficult for these algorithms to provide both efficiency and scalability. Recently, Centralized planning algorithm FV-MCTS-MP and decentralized planning algorithm \textit{Alternate maximization with Behavioural Cloning} (ABC) have achieved notable performance in solving MMDPs. However, they are not capable of adapting to dynamically changing environments and accounting for the lack of communication among agents, respectively. Against this background, we introduce a simulation based online planning algorithm, that we call SiCLOP, for multi-agent cooperative environments. Specifically, SiCLOP tailors Monte Carlo Tree Search (MCTS) and uses Coordination Graph (CG) and Graph Neural Network (GCN) to learn cooperation and provides real time solution of a MMDP problem. It also improves scalability through an effective pruning of action space. Additionally, unlike FV-MCTS-MP and ABC, SiCLOP supports transfer learning, which enables learned agents to operate in different environments. We also provide theoretical discussion about the convergence property of our algorithm within the context of multi-agent settings. Finally, our extensive empirical results show that SiCLOP significantly outperforms the state-of-the-art online planning algorithms.

翻译：多试剂Markov决定程序(MMDP)是模拟多试剂合作环境的连续决策算法的有效方法,但在这方面已经发展了一些基于中央和分散规划的算法,然而,由于环境动态变化,加上国家和联合行动空间的指数大小,使得这些算法难以提供效率和可缩放性。最近,中央化规划算法FV-MCTS-MP和分散化规划算法与行为克隆(ABC)在解决多试剂合作环境方面取得了显著的成绩。然而,这些算法无法适应动态变化的环境,也无法分别考虑到代理人之间缺乏沟通的情况。在此背景下,我们采用了基于模拟的在线规划算法,我们称之为SICLOP,用于多试合作环境。具体来说,SICLOP裁量蒙特卡洛树搜索(MC)和使用协调图(CG)和图神经网络(GCN)来学习合作,为MDP问题提供真正的时间解决方案。此外,它们无法适应动态变化环境的动态规划环境,并计算出代理人之间缺乏沟通的动态规划。在此背景下,我们称为SICMLMLMLA(我们所学到的磁体)的横向操作环境。