Mars Explorer: 通过深强化学习和程序生成的环境探索未知地形 (MarsExplorer: Exploration of Unknown Terrains via Deep Reinforcement Learning and Procedurally Generated Environments)

This paper is an initial endeavor to bridge the gap between powerful Deep Reinforcement Learning methodologies and the problem of exploration/coverage of unknown terrains. Within this scope, MarsExplorer, an openai-gym compatible environment tailored to exploration/coverage of unknown areas, is presented. MarsExplorer translates the original robotics problem into a Reinforcement Learning setup that various off-the-shelf algorithms can tackle. Any learned policy can be straightforwardly applied to a robotic platform without an elaborate simulation model of the robot's dynamics to apply a different learning/adaptation phase. One of its core features is the controllable multi-dimensional procedural generation of terrains, which is the key for producing policies with strong generalization capabilities. Four different state-of-the-art RL algorithms (A3C, PPO, Rainbow, and SAC) are trained on the MarsExplorer environment, and a proper evaluation of their results compared to the average human-level performance is reported. In the follow-up experimental analysis, the effect of the multi-dimensional difficulty setting on the learning capabilities of the best-performing algorithm (PPO) is analyzed. A milestone result is the generation of an exploration policy that follows the Hilbert curve without providing this information to the environment or rewarding directly or indirectly Hilbert-curve-like trajectories. The experimental analysis is concluded by comparing PPO learned policy results with frontier-based exploration context for extended terrain sizes. The source code can be found at: https://github.com/dimikout3/GeneralExplorationPolicy.

翻译：本文是缩小强大的深层强化学习方法与未知地形勘探/覆盖问题之间差距的初步努力。在此范围内, 演示了火星探索者( MarsExplorer), 是一个为未知区域勘探/覆盖而专门设计的开放- gym兼容环境。 MarsExplorer 将原始机器人问题转化为强化学习设置, 各种现成算法可以解决。任何学习的政策都可以直接应用到机器人平台上, 没有机器人动态的精细模拟模型来应用不同的学习/适应阶段。它的核心特征之一是可控多维程序地形生成, 这是产生具有强强集成能力的政策的关键。四个不同的州一级RL算法( A3C、PPO、彩虹和SAC) 都接受了火星探索者环境的培训, 与平均人类水平业绩相比, 任何对其结果的适当评估都可报告。在后续的实验分析中, 多种层面的地形难度对最佳表现背景算法(PPPPOO) 生成的逻辑生成能力产生了影响, 以不直接分析方式分析。将模型定位政策生成结果, 将分析以向核心勘探结果。。。分析将分析将将分析分析以分析分析分析将分析将分析将将将将方向分析分析方向分析以以以将将方向分析分析分析分析分析将分析分析分析分析将将分析将将分析分析分析分析分析将将分析分析将分析分析将将分析将将分析将分析将将将分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析分析将分析分析分析将将分析分析分析分析分析将以分析分析分析分析分析分析分析以将分析分析以以以将将将将分析将将将分析