We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i.e., every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.
翻译:我们考虑合作探索的问题,即多个机器人需要尽可能快地合作探索一个未知区域。多试剂加固学习(MARL)最近成为了应对这一挑战的趋势模式。然而,现有的基于MARL的方法采用行动化步骤作为衡量勘探效率的衡量标准,假设所有代理商都以完全同步的方式行事:即,每个代理商同时产生行动,每个单一行动都同时执行。尽管这种数学简单,但这种同步的MARL配方对于真实世界的机器人应用来说可能存在问题。不同机器人可能花上几小时的时间完成原子行动,甚至由于硬件问题而定期丢失。仅仅等待每个机器人准备采取下一个行动的效率,就特别缺乏时间效率。因此,我们建议一种不稳的MARL解决方案,而每一步即时执行一次行动。尽管其数学简单简单,但这种基于经典的MARL 低效率计算法、多试剂PPO(MAPO)对于现实的改进方法可能存在问题。 不同的直观设定,甚至更短的直观的直观的直径,甚至会因硬件操作过程而将一个行动性地运用于实际的计算结果。