The safety constraints commonly used by existing safe reinforcement learning (RL) methods are defined only on expectation of initial states, but allow each certain state to be unsafe, which is unsatisfying for real-world safety-critical tasks. In this paper, we introduce the feasible actor-critic (FAC) algorithm, which is the first model-free constrained RL method that considers statewise safety, e.g, safety for each initial state. We claim that some states are inherently unsafe no matter what policy we choose, while for other states there exist policies ensuring safety, where we say such states and policies are feasible. By constructing a statewise Lagrange function available on RL sampling and adopting an additional neural network to approximate the statewise Lagrange multiplier, we manage to obtain the optimal feasible policy which ensures safety for each feasible state and the safest possible policy for infeasible states. Furthermore, the trained multiplier net can indicate whether a given state is feasible or not through the statewise complementary slackness condition. We provide theoretical guarantees that FAC outperforms previous expectation-based constrained RL methods in terms of both constraint satisfaction and reward optimization. Experimental results on both robot locomotive tasks and safe exploration tasks verify the safety enhancement and feasibility interpretation of the proposed method.
翻译:现有安全强化学习(RL)方法通常使用的安全限制仅根据对初始国家的期望来确定,但允许每个特定国家处于不安全状态,这对现实世界的安全至关重要的任务是不满意的。在本文件中,我们引入了可行的行为者-批评算法(FAC)算法(FAC),这是第一种无模型限制的RL方法,认为是州性安全,例如每个初始国家的安全。我们声称,有些国家本质上是不安全的,不管我们选择什么政策,而另一些国家则有确保安全的政策,我们认为此类国家和政策是可行的。通过在RL取样方面建立一个州性拉格朗函数,并采用额外的神经网络,以接近州性拉格朗乘数,我们设法获得最佳可行的政策,确保每个可行国家的安全,并为不可行国家制定最安全的政策。此外,经过培训的乘数网可以表明某个州是否可行,不管我们选择什么政策,而其他国家则有确保安全,而我们说,在这些国家和政策是可行的情况下,有这样的政策。我们提供理论保证,在限制满意度和优化安全可行性提高试验方法方面,同时核查拟议的实验性试验方法。