PriMAL2:通过强化和模仿探索多机构学习 -- -- 终身 (PRIMAL2: Pathfinding via Reinforcement and Imitation Multi-Agent Learning -- Lifelong)

Multi-agent path finding (MAPF) is an indispensable component of large-scale robot deployments in numerous domains ranging from airport management to warehouse automation. In particular, this work addresses lifelong MAPF (LMAPF) - an online variant of the problem where agents are immediately assigned a new goal upon reaching their current one - in dense and highly structured environments, typical of real-world warehouse operations. Effectively solving LMAPF in such environments requires expensive coordination between agents as well as frequent replanning abilities, a daunting task for existing coupled and decoupled approaches alike. With the purpose of achieving considerable agent coordination without any compromise on reactivity and scalability, we introduce PRIMAL2, a distributed reinforcement learning framework for LMAPF where agents learn fully decentralized policies to reactively plan paths online in a partially observable world. We extend our previous work, which was effective in low-density sparsely occupied worlds, to highly structured and constrained worlds by identifying behaviors and conventions which improve implicit agent coordination, and enable their learning through the construction of a novel local agent observation and various training aids. We present extensive results of PRIMAL2 in both MAPF and LMAPF environments and compare its performance to state-of-the-art planners in terms of makespan and throughput. We show that PRIMAL2 significantly surpasses our previous work and performs comparably to these baselines, while allowing real-time re-planning and scaling up to 2048 agents.

翻译：在机场管理和仓储自动化等诸多领域,大规模机器人部署的大型机器人发现多剂路径(MAPF)是一个不可或缺的组成部分,特别是,这项工作涉及终生MAPF(LMAPF) -- -- 问题的一个在线变体,即代理人在达到当前目标后立即被分配到一个新的目标 -- -- 在密集和高度结构化的环境中,这是真实世界仓储业务的典型特点。在这种环境中有效解决LMAPF,需要代理人之间进行昂贵的协调以及频繁的再规划能力,这是现有的相互配合和分解办法的一项艰巨任务。为了在不妥协回旋性和可伸缩性的情况下实现大量的代理人协调,我们为LIMAL2推出了一个分布式强化学习框架,使代理人学习完全分散的政策,以便在部分可观察的世界中被动地规划在线路径。我们把以前的工作,即低密度、分散的世界所占据的世界有效,扩大到高度结构化和受限的世界,通过构建新的地方代理人观察和各种训练辅助工具来学习它们。我们在MAPF和IMAP2的实际规划过程中,在实际规划阶段将PIMAL2取得广泛的成果,同时将以前的业绩与以往的进度对比。