Safe exploration is crucial for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential dangers, human tends to stop immediately and rarely learns to behave safely in danger. Motivated by human learning, we introduce a new approach to address safe RL problems under the framework of Early Terminated MDP (ET-MDP). We first define the ET-MDP as an unconstrained MDP with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better asymptotic performance and improved learning efficiency. Experiments on various CMDP tasks show a substantial improvement over previous methods that directly solve CMDP.
翻译:安全探索对于在现实世界应用强化学习(RL)至关重要。 以前的工程将安全探索问题视为在制约下优化政策的Consstrate Markov决策程序(CMDP),但当遇到任何潜在危险时,人类往往会立即停止,很少学会在危险中安全行事。在人类学习的推动下,我们引入了在早期终止的MDP(ET-MDP)框架内解决安全的RL问题的新办法。我们首先将ET-MDP定义为一个不受限制的MDP,其最佳价值功能与相应的CMDP相同。然后提议以环境模型为基础的非政策算法来解决ET-MDP,从而以更好的无损性表现和提高学习效率的方式解决相应的CMDP。在对各种CMDP任务进行的实验表明,与以前直接解决CMDP的方法相比,有了实质性的改进。