Safety is a critical hurdle that limits the application of deep reinforcement learning (RL) to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision processes. However, such constrained RL methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safe RL benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.
翻译:安全是限制深加学习(RL)应用到现实世界控制任务的关键障碍。为此,受限制的强化学习在限制的Markov决策程序中利用成本功能来提高安全性。然而,这种受限制的RL方法即使成本限制为零,也未能达到零违反。本文分析了这种失败的原因,表明适当的成本功能在限制RL中起着重要作用。 我们提议AutoCost,这是一个简单而有效的框架,可以自动搜索成本功能,帮助限制RL实现零违反性能。我们验证了拟议方法和安全RL基准安全系统所搜索的成本功能。我们比较了使用我们成本功能的增强的代理的绩效,这些代理使用相同的政策学习者提供附加的内在成本,但仅使用外在成本。结果显示,在所有环境中与内在成本相融合的政策都实现了零约束性违反,与基线的类似性业绩。