We present a self-improving, neural tree expansion method for multi-robot online planning in non-cooperative environments, where each robot tries to maximize its cumulative reward while interacting with other self-interested robots. Our algorithm adapts the centralized, perfect information, discrete-action space method from Alpha Zero to a decentralized, partial information, continuous action space setting for multi-robot applications. Our method has three interacting components: (i) a centralized, perfect-information `expert' Monte Carlo Tree Search (MCTS) with large computation resources that provides expert demonstrations, (ii) a decentralized, partial-information `learner' MCTS with small computation resources that runs in real-time and provides self-play examples, and (iii) policy & value neural networks that are trained with the expert demonstrations and bias both the expert and the learner tree growth. Our numerical experiments demonstrate neural expansion generates compact search trees with better solution quality and 20 times less computational expense compared to MCTS without neural expansion. The resulting policies are dynamically sophisticated, demonstrate coordination between robots, and play the Reach-Target-Avoid differential game significantly better than the state-of-the-art control-theoretic baseline for multi-robot, double-integrator systems. Our hardware experiments on an aerial swarm demonstrate the computational advantage of neural tree expansion, enabling online planning at 20Hz with effective policies in complex scenarios.
翻译:我们提出了一种自我改进、神经树扩张的方法,用于在不合作的环境中进行多机器人在线规划,在这种环境中,每个机器人试图在与其他自我感兴趣的机器人互动的同时,最大限度地增加累积的奖励。我们的算法将中央、完美的信息、离散的动作空间方法从阿尔法Zero改成分散化的、部分的信息和连续的多机器人应用行动空间设置。我们的方法有三个相互作用的组成部分:(一) 集中的、完美的信息“专家”蒙特卡洛树搜索(MCTS),具有大量计算资源,提供专家演示;(二) 分散的、部分的信息“远程”MCTS,具有小规模的计算资源,实时运行并提供自我游戏范例;(三) 政策和价值神经网络,经过专家演示和偏向专家和学习者树的增长。我们的数字实验显示神经扩张产生紧凑的搜索树,其溶性质量更高,比计算费用低20倍于计算成本,而没有神经扩张。由此产生的政策是动态精密的,展示了机器人之间的协调,并在“达氏-塔里”号双轨的双向机型空中实验中,展示了我们最接近的硬的游戏系统。