Deep reinforcement learning in continuous domains focuses on learning control policies that map states to distributions over actions that ideally concentrate on the optimal choices in each step. In multi-agent navigation problems, the optimal actions depend heavily on the agents' density. Their interaction patterns grow exponentially with respect to such density, making it hard for learning-based methods to generalize. We propose to switch the learning objectives from predicting the optimal actions to predicting sets of admissible actions, which we call control admissibility models (CAMs), such that they can be easily composed and used for online inference for an arbitrary number of agents. We design CAMs using graph neural networks and develop training methods that optimize the CAMs in the standard model-free setting, with the additional benefit of eliminating the need for reward engineering typically required to balance collision avoidance and goal-reaching requirements. We evaluate the proposed approach in multi-agent navigation environments. We show that the CAM models can be trained in environments with only a few agents and be easily composed for deployment in dense environments with hundreds of agents, achieving better performance than state-of-the-art methods.
翻译:连续领域的深入强化学习侧重于学习控制政策,该政策绘制出分配最理想地集中于每个步骤的最佳选择的行动的分布。在多试剂导航问题中,最佳行动在很大程度上取决于物剂密度。它们的互动模式在这种密度方面成倍增长,使得学习方法难以推广。我们提议把学习目标从预测最佳行动转向预测可受理行动的系列,我们称之为控制可采性模式,这样它们就可以很容易地组成并用于任意数个物剂的在线推断。我们利用图形神经网络设计CAM,并开发培训方法,在标准型号环境下优化CAM,另外还的好处是消除通常平衡避免碰撞和达到目的要求所需的奖励工程需求。我们评估多试剂导航环境中的拟议方法。我们表明,CAM模型可以在环境中培训,只有少数物剂,并且很容易在密密环境中部署数百个物剂,实现优于现状方法的更好性能。