PriMAL2:通过强化和模仿探索多机构学习 -- -- 终身 (PRIMAL2: Pathfinding via Reinforcement and Imitation Multi-Agent Learning -- Lifelong)

from arxiv, \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Multi-agent path finding (MAPF) is an indispensable component of large-scale robot deployments in numerous domains ranging from airport management to warehouse automation. In particular, this work addresses lifelong MAPF (LMAPF) - an online variant of the problem where agents are immediately assigned a new goal upon reaching their current one - in dense and highly structured environments, typical of real-world warehouse operations. Effectively solving LMAPF in such environments requires expensive coordination between agents as well as frequent replanning abilities, a daunting task for existing coupled and decoupled approaches alike. With the purpose of achieving considerable agent coordination without any compromise on reactivity and scalability, we introduce PRIMAL2, a distributed reinforcement learning framework for LMAPF where agents learn fully decentralized policies to reactively plan paths online in a partially observable world. We extend our previous work, which was effective in low-density sparsely occupied worlds, to highly structured and constrained worlds by identifying behaviors and conventions which improve implicit agent coordination, and enable their learning through the construction of a novel local agent observation and various training aids. We present extensive results of PRIMAL2 in both MAPF and LMAPF environments and compare its performance to state-of-the-art planners in terms of makespan and throughput. We show that PRIMAL2 significantly surpasses our previous work and performs comparably to these baselines, while allowing real-time re-planning and scaling up to 2048 agents.

翻译：在机场管理和仓储自动化等诸多领域,大规模机器人部署的大型机器人发现多剂路径(MAPF)是一个不可或缺的组成部分,特别是,这项工作涉及终生MAPF(LMAPF) -- -- 问题的一个在线变体,即代理人在达到当前目标后立即被分配到一个新的目标 -- -- 在密集和高度结构化的环境中,这是真实世界仓储业务的典型特点。在这种环境中有效解决LMAPF,需要代理人之间进行昂贵的协调以及频繁的再规划能力,这是现有的相互配合和分解办法的一项艰巨任务。为了在不妥协回旋性和可伸缩性的情况下实现大量的代理人协调,我们为LIMAL2推出了一个分布式强化学习框架,使代理人学习完全分散的政策,以便在部分可观察的世界中被动地规划在线路径。我们把以前的工作,即低密度、分散的世界所占据的世界有效,扩大到高度结构化和受限的世界,通过构建新的地方代理人观察和各种训练辅助工具来学习它们。我们在MAPF和IMAP2的实际规划过程中,在实际规划阶段将PIMAL2取得广泛的成果,同时将以前的业绩与以往的进度对比。