Safety comes first in many real-world applications involving autonomous agents. Despite a large number of reinforcement learning (RL) methods focusing on safety-critical tasks, there is still a lack of high-quality evaluation of those algorithms that adheres to safety constraints at each decision step under complex and unknown dynamics. In this paper, we revisit prior work in this scope from the perspective of state-wise safe RL and categorize them as projection-based, recovery-based, and optimization-based approaches, respectively. Furthermore, we propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. This novel technique explicitly enforces hard constraints via the deep unrolling architecture and enjoys structural advantages in navigating the trade-off between reward improvement and constraint satisfaction. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit, a toolkit that provides off-the-shelf interfaces and evaluation utilities for safety-critical tasks. We then perform a comparative study of the involved algorithms on six benchmarks ranging from robotic control to autonomous driving. The empirical results provide an insight into their applicability and robustness in learning zero-cost-return policies without task-dependent handcrafting. The project page is available at https://sites.google.com/view/saferlkit.
翻译:尽管有大量强化学习(RL)方法,侧重于安全关键任务,但仍然缺乏对在复杂和未知动态下每个决策阶段遵守安全限制的各种算法的高质量评价。在本文件中,我们从州安全RL的角度重新审视了这一范围内先前的工作,并将其分为基于投影、基于恢复和基于优化的方法。此外,我们提议“Unlroll安全层”(USL),这是将安全优化和安全预测相结合的一种联合方法。这种新颖技术明确通过深度的不旋转结构实施硬性限制,并享有结构优势,在奖励改善和制约满意度之间进行权衡。为了便利这一领域的进一步研究,我们将相关的算法复制到统一的管道中,并将其纳入SafeRL-Kit,这是一个工具包,为安全批评任务提供现成的界面和评价工具。我们随后对从机器人控制到自主驱动的六个基准所涉及的算法进行了比较研究。实证结果提供了在应用性和稳健性之间穿透度之间的结构优势。实验性结果提供了在不依赖安全性/安全性项目上学习安全性成本。