This paper is an initial endeavor to bridge the gap between powerful Deep Reinforcement Learning methodologies and the problem of exploration/coverage of unknown terrains. Within this scope, MarsExplorer, an openai-gym compatible environment tailored to exploration/coverage of unknown areas, is presented. MarsExplorer translates the original robotics problem into a Reinforcement Learning setup that various off-the-shelf algorithms can tackle. Any learned policy can be straightforwardly applied to a robotic platform without an elaborate simulation model of the robot's dynamics to apply a different learning/adaptation phase. One of its core features is the controllable multi-dimensional procedural generation of terrains, which is the key for producing policies with strong generalization capabilities. Four different state-of-the-art RL algorithms (A3C, PPO, Rainbow, and SAC) are trained on the MarsExplorer environment, and a proper evaluation of their results compared to the average human-level performance is reported. In the follow-up experimental analysis, the effect of the multi-dimensional difficulty setting on the learning capabilities of the best-performing algorithm (PPO) is analyzed. A milestone result is the generation of an exploration policy that follows the Hilbert curve without providing this information to the environment or rewarding directly or indirectly Hilbert-curve-like trajectories. The experimental analysis is concluded by evaluating PPO learned policy algorithm side-by-side with frontier-based exploration strategies. A study on the performance curves revealed that PPO-based policy was capable of performing adaptive-to-the-unknown-terrain sweeping without leaving expensive-to-revisit areas uncovered, underlying the capability of RL-based methodologies to tackle exploration tasks efficiently. The source code can be found at: https://github.com/dimikout3/MarsExplorer.
翻译:本文是缩小强大的深层强化学习方法与未知地形勘探/覆盖问题之间差距的初步努力。 在此范围内, 演示了火星探索者( MarsExplorer), 是一个为未知区域勘探/覆盖而专门设计的开放- gym兼容环境。 火星探索者( A3C、 PPO、 Rainbow 和 SAC) 将原始机器人问题转化为一个强化学习学习学习设置, 各种现成的算法可以解决。 任何学习的政策都可以直接应用到机器人平台上, 没有机器人动态的详细模拟模型来应用不同的学习/ 适应阶段。 它的核心特征之一是可控的多维程序生成地形: 这是生成具有强强集力总体化能力的政策的关键。 四个不同的高级机器人算法( A3C、 PPPO、 Rabower 和 SAC) 被培训到火星探索者环境中, 对其结果进行与平均人类层面业绩的正确评估 。 在后续实验分析中, 一个多维的难度对最佳端边端( Plevel- liversal- dalvalation) 战略( Plevelop levelop) 分析( Plevelop) 进行该分析, 数据分析, 向这一分析, 提供该分析。 一个里程碑- deal- preal- devial- disal- disal- disal- deal- deal- devial- disal- disal- disalmentalmental) 进行该分析, 进行这一分析。