Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an issue in safe model-based RL, especially in training time safety. In this paper, we propose a distributional reachability certificate (DRC) and its Bellman equation to address model uncertainties and characterize robust persistently safe states. Furthermore, we build a safe RL framework to resolve constraints required by the DRC and its corresponding shield policy. We also devise a line search method to maintain safety and reach higher returns simultaneously while leveraging the shield policy. Comprehensive experiments on classical benchmarks such as constrained tracking and navigation indicate that the proposed algorithm achieves comparable returns with much fewer constraint violations during training.
翻译:安全强化学习(RL)解决了限制-满意的政策,为在机器人等现实世界问题中更广泛地安全关键应用RL提供了一个有希望的途径。在所有安全的RL方法中,基于模型的方法会减少由于抽样效率高而进一步违反培训时间的情况。然而,在基于模型的安全模式的RL中,特别是在培训时间安全方面,缺乏对模型不确定性的安全稳健性仍然是一个问题。在本文件中,我们建议发放可分配性证书及其贝尔曼方程式,以解决模型不确定性,并给强健的持久安全状态定性。此外,我们建立了一个安全RL框架,以解决刚果民主共和国及其相应的防护政策所要求的限制。我们还设计了一条线搜索方法,既维护安全,同时又在利用防护政策的同时实现更高的回报。关于限制跟踪和导航等典型基准的全面实验表明,拟议的算法在培训期间实现类似的回报,限制违规现象要少得多。