Constrained Reinforcement Learning (CRL) has gained significant interest recently, since the satisfaction of safety constraints is critical for real world problems. However, existing CRL methods constraining discounted cumulative costs generally lack rigorous definition and guarantee of safety. On the other hand, in the safe control research, safety is defined as persistently satisfying certain state constraints. Such persistent safety is possible only on a subset of the state space, called feasible set, where an optimal largest feasible set exists for a given environment. Recent studies incorporating safe control with CRL using energy-based methods such as control barrier function (CBF), safety index (SI) leverage prior conservative estimation of feasible sets, which harms performance of the learned policy. To deal with this problem, this paper proposes a reachability CRL (RCRL) method by using reachability analysis to characterize the largest feasible sets. We characterize the feasible set by the established self-consistency condition, then a safety value function can be learned and used as constraints in CRL. We also use the multi-time scale stochastic approximation theory to prove that the proposed algorithm converges to a local optimum, where the largest feasible set can be guaranteed. Empirical results on different benchmarks such as safe-control-gym and Safety-Gym validate the learned feasible set, the performance in optimal criteria, and constraint satisfaction of RCRL, compared to state-of-the-art CRL baselines.
翻译:由于对安全限制的满意程度对于现实世界问题至关重要,因此最近对加强强化学习(CRL)的兴趣已大为增加,因为安全限制的满足程度对真实世界问题至关重要;然而,限制贴现累积成本的现有CRL方法通常缺乏严格的定义和安全保障;另一方面,在安全控制研究中,安全的定义是持续满足某些国家制约因素;这种持久性安全只有在国家空间的一个子组上才有可能,即所谓的“可行成套”,其中为特定环境提供了最佳的可行成套套套件;最近利用控制屏障功能(CBF)、安全指数(SI)等基于能源的方法将安全控制与CRL(CRL)结合在一起的研究利用了对可行数据集的保守估计,这损害了所学政策的绩效;为解决这一问题,本文件建议采用CRRL(RCL)(RCL)(RCL)(RCL)(C)(CL)(CRR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(C)(CR)(CR)(CR)(CL)(CR)(CR)(CR)(C)(C)(C)(CR)(C)(C)(C)(C)(CR)(CR)(CR)(CR)(CR)(CR)(CR)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C) (C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C) (C)(C)(C)(C)(C)(C)(C) (C)))(C) (C)(C)(C)(C)(C)(C)(C)(C)(C)(C