Safe exploration is a challenging and important problem in model-free reinforcement learning (RL). Often the safety cost is sparse and unknown, which unavoidably leads to constraint violations -- a phenomenon ideally to be avoided in safety-critical applications. We tackle this problem by augmenting the state-space with a safety state, which is nonnegative if and only if the constraint is satisfied. The value of this state also serves as a distance toward constraint violation, while its initial value indicates the available safety budget. This idea allows us to derive policies for scheduling the safety budget during training. We call our approach Simmer (Safe policy IMproveMEnt for RL) to reflect the careful nature of these schedules. We apply this idea to two safe RL problems: RL with constraints imposed on an average cost, and RL with constraints imposed on a cost with probability one. Our experiments suggest that simmering a safe algorithm can improve safety during training for both settings. We further show that Simmer can stabilize training and improve the performance of safe RL with average constraints.
翻译:安全探索是无模式强化学习(RL)中一个具有挑战性和重要性的问题。安全成本往往稀少且未知,这不可避免地导致限制违规现象 -- -- 一种在安全关键应用中最好避免的现象。我们通过以安全状态扩大国家空间来解决这一问题,这种状态如果而且只有在满足了限制条件的情况下是非消极的。这个状态的价值还起到抑制违规的距离的作用,而其初始价值则表明现有的安全预算。这一想法使我们能够在培训期间制定安全预算的时间安排政策。我们叫Simmer(Safe Policy IMprovemant for RL)来反映这些时间表的谨慎性。我们把这个想法应用于两个安全的RL问题:对平均成本施加限制的RL,对成本施加限制的RL,对概率一施加限制的RL。我们的实验表明,在培训期间,安全算法可以改善两种环境的安全。我们进一步表明,Simmer可以稳定培训,以平均限制条件改进安全RL的性能。