In real-world multi-robot systems, performing high-quality, collaborative behaviors requires robots to asynchronously reason about high-level action selection at varying time durations. Macro-Action Decentralized Partially Observable Markov Decision Processes (MacDec-POMDPs) provide a general framework for asynchronous decision making under uncertainty in fully cooperative multi-agent tasks. However, multi-agent deep reinforcement learning methods have only been developed for (synchronous) primitive-action problems. This paper proposes two Deep Q-Network (DQN) based methods for learning decentralized and centralized macro-action-value functions with novel macro-action trajectory replay buffers introduced for each case. Evaluations on benchmark problems and a larger domain demonstrate the advantage of learning with macro-actions over primitive-actions and the scalability of our approaches.
翻译:在现实世界的多机器人系统中,合作行为要求机器人对不同时间段的高层次行动选择有不可置疑的理性。宏观行动分散式部分可观测的Markov决策程序(MacDec-POMDPs)提供了一个总框架,用于在全面合作的多代理人任务不确定的情况下作出不同步决策。然而,多试剂深度强化学习方法只针对(同步的)原始行动问题开发。本文件提出了两种基于深Q-Network(DQN)的方法,用于学习分散式和集中式的宏观行动价值功能,同时对每种情况采用新的宏观行动轨迹重弹缓冲。对基准问题和更大领域的评价显示了在原始行动上与宏观行动学习的优势以及我们方法的可扩展性。