DRL-VO: 学习如何利用高速障碍通过人群聚众的动态场景进行导航 (DRL-VO: Learning to Navigate Through Crowded Dynamic Scenes Using Velocity Obstacles)

This paper proposes a novel learning-based control policy with strong generalizability to new environments that enables a mobile robot to navigate autonomously through spaces filled with both static obstacles and dense crowds of pedestrians. The policy uses a unique combination of input data to generate the desired steering angle and forward velocity: a short history of lidar data, kinematic data about nearby pedestrians, and a sub-goal point. The policy is trained in a reinforcement learning setting using a reward function that contains a novel term based on velocity obstacles to guide the robot to actively avoid pedestrians and move towards the goal. Through a series of 3D simulated experiments with up to 55 pedestrians, this control policy is able to achieve a better balance between collision avoidance and speed (i.e. higher success rate and faster average speed) than state-of-the-art model-based and learning-based policies, and it also generalizes better to different crowd sizes and unseen environments. An extensive series of hardware experiments demonstrate the ability of this policy to directly work in different real-world environments with different crowd sizes with zero retraining. Furthermore, a series of simulated and hardware experiments show that the control policy also works in highly constrained static environments on a different robot platform without any additional training. Lastly, we summarize several important lessons that can be applied to other robot learning systems. Multimedia demonstrations are available at https://www.youtube.com/watch?v=eCcNYSbgCv8&list=PLouWbAcP4zIvPgaARrV223lf2eiSR-eSS.

翻译：本文提出了一个基于学习的新型控制政策,该政策在新的环境中具有很强的普遍性,使移动机器人能够在充满静态障碍和密集行人群的空格中自主导航。该政策使用独特的投入数据组合来生成理想方向和前方速度:利达尔数据短史、附近行人动态数据以及一个次级目标点。该政策在强化学习设置中受到培训,该功能包含一个基于速度障碍的新术语,以引导机器人积极避免行人并朝着目标前进。通过一系列由55行人组成的3D模拟实验,该控制政策能够在避免碰撞和速度(即更高的成功率和更快的平均速度)之间实现更好的平衡:比基于模型和学习的最先进政策更短的历史,而且该政策还比较适用于不同的人群大小和看不见的环境。一系列的硬件实验表明该政策有能力在不同的现实世界环境中直接工作,且有不同的人群规模,进行零再培训。此外,一系列模拟和硬件实验能够更好地平衡避免碰撞和速度(即更高的成功率)和速度(即更快的平均速度)与基于模型和学习模式的政策系统相比,我们也可以在不同的机器人系统中进行更多的学习。