加速模拟到实际深强化学习:避免人类玩家的学习碰撞 (Accelerated Sim-to-Real Deep Reinforcement Learning: Learning Collision Avoidance from Human Player)

This paper presents a sensor-level mapless collision avoidance algorithm for use in mobile robots that map raw sensor data to linear and angular velocities and navigate in an unknown environment without a map. An efficient training strategy is proposed to allow a robot to learn from both human experience data and self-exploratory data. A game format simulation framework is designed to allow the human player to tele-operate the mobile robot to a goal and human action is also scored using the reward function. Both human player data and self-playing data are sampled using prioritized experience replay algorithm. The proposed algorithm and training strategy have been evaluated in two different experimental configurations: \textit{Environment 1}, a simulated cluttered environment, and \textit{Environment 2}, a simulated corridor environment, to investigate the performance. It was demonstrated that the proposed method achieved the same level of reward using only 16\% of the training steps required by the standard Deep Deterministic Policy Gradient (DDPG) method in Environment 1 and 20\% of that in Environment 2. In the evaluation of 20 random missions, the proposed method achieved no collision in less than 2~h and 2.5~h of training time in the two Gazebo environments respectively. The method also generated smoother trajectories than DDPG. The proposed method has also been implemented on a real robot in the real-world environment for performance evaluation. We can confirm that the trained model with the simulation software can be directly applied into the real-world scenario without further fine-tuning, further demonstrating its higher robustness than DDPG. The video and code are available: https://youtu.be/BmwxevgsdGc https://github.com/hanlinniu/turtlebot3_ddpg_collision_avoidance

翻译：本文展示了用于移动机器人的感官级无地图式避免碰撞算法,该算法将原始传感器数据映射成线形和角形速度,在没有地图的情况下在未知的环境中导航。提议了一个高效的培训战略, 以使机器人既学习人类经验数据,又学习自我探索数据。设计了一个游戏格式模拟框架, 使人类玩家能够将移动机器人远程操作到一个目标, 也使用奖赏功能给人类行动评分。人类玩家数据和自播放数据都使用优先经验重编算法进行抽样。拟议的算法和培训战略已经用两种不同的实验配置来评价:\ textit{Environ {Environ 1: 模拟阴滑环境, 模拟走廊环境模拟环境的模拟==Grenti_Environtreal_Gnal_Dcreal_Blational_Brentral deview) 。拟议的方法可以进一步评估20次随机任务, 拟议的方法没有在2~ how_ hy_ grow_ grental develil develilmental_ dal deview destal deview destal deview sal sal be sal supal be supal be supal be be be be be supal dal be be be be be delal be delviewd. lapal a roduew. 在2_ delmental a delmental a delmental be supal be supal supal supal supal a rodududuce. roduce. lad. lad. 在2_Gh sal sal supal sal supd. 在2_Gh sal sal a 和 2. 和在2_ del sal dal a 和 2. rodal dal a be be be be be be be be be be be rovaldal dal dal a roal a roal a roal a roal a rod sal a roal roal roal a roal sal a roal a roal a roal a lad lad rodal a be be be