We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i.e., every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.
翻译:本文考虑多机器人协同探索的问题,其中多个机器人需要尽快协同探索未知区域。多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)最近成为解决这一挑战的主流范式。现有的基于MARL的方法采用行动步骤作为探索效率的度量标准,即假设所有智能体都在完全同步的情况下行动:即每个智能体同时产生一个动作,每个动作在每个时间步长中立即执行。尽管这样的“同步MARL”数学上很简单,但它在实际机器人应用中可能存在问题。不同机器人完成原子动作可能需要略微不同的墙钟时间,甚至可能由于硬件问题定期迷路。等待每个机器人准备好下一个动作可能会特别耗费时间。因此,我们针对这一实际问题提出了一种异步MARL解决方案,即异步协调探索器(Asynchronous Coordination Explorer, ACE)。我们首先将传统的MARL算法——多智能体PPO(Multi-Agent PPO, MAPPO)扩展到异步设置,并额外采用行动延迟随机化来使学习策略更好地适应于现实世界中的不同行动延迟。此外,每个导航智能体均表示为一个与团队大小无关的基于CNN的策略,这有助于实际机器人部署通过处理可能的机器人丢失,同时允许通过低维CNN特征进行带宽高效的智能体内通信。我们首先验证了我们的方法在基于网格的场景中的可行性。通过模拟和实际机器人实验表明,ACE相比经典方法能够减少10%以上的实际探索时间。我们还在一个高保真度的基于视觉的环境——Habitat上使用我们的框架,实现了28%的探索效率提升。