Safe reinforcement learning (RL) aims to solve an optimal control problem under safety constraints. Existing $\textit{direct}$ safe RL methods use the original constraint throughout the learning process. They either lack theoretical guarantees of the policy during iteration or suffer from infeasibility problems. To address this issue, we propose an $\textit{indirect}$ safe RL method called feasible policy iteration (FPI) that iteratively uses the feasible region of the last policy to constrain the current policy. The feasible region is represented by a feasibility function called constraint decay function (CDF). The core of FPI is a region-wise policy update rule called feasible policy improvement, which maximizes the return under the constraint of the CDF inside the feasible region and minimizes the CDF outside the feasible region. This update rule is always feasible and ensures that the feasible region monotonically expands and the state-value function monotonically increases inside the feasible region. Using the feasible Bellman equation, we prove that FPI converges to the maximum feasible region and the optimal state-value function. Experiments on classic control tasks and Safety Gym show that our algorithms achieve lower constraint violations and comparable or higher performance than the baselines.
翻译:安全强化学习(RL)旨在在安全约束下解决最优控制问题。现有的$\textit{直接}$安全 RL 方法在整个学习过程中使用原始约束。它们要么缺乏迭代过程中策略的理论保证,要么面临不可行性问题。为了解决这个问题,我们提出了一种$\textit{间接}$安全 RL 方法,称为可行策略迭代(FPI),它使用上一策略的可行区域迭代地约束当前策略。可行区域由一种称为约束衰减函数(CDF)的可行性函数表示。FPI 的核心是一种称为可行策略改进的区域性策略更新规则,它在可行区域内在约束CDF的条件下最大化回报,在可行区域外最小化CDF。此更新规则始终可行,并确保可行区域在内部提高状态值函数的单调递增。使用可行贝尔曼方程,我们证明 FPI 收敛于最大可行区域和最优状态值函数。在经典控制任务和 Safety Gym 上进行的实验表明,我们的算法实现了更低的约束违规和与基线相当或更高的性能。