Safety is essential for deploying Deep Reinforcement Learning (DRL) algorithms in real-world scenarios. Recently, verification approaches have been proposed to allow quantifying the number of violations of a DRL policy over input-output relationships, called properties. However, such properties are hard-coded and require task-level knowledge, making their application intractable in challenging safety-critical tasks. To this end, we introduce the Collection and Refinement of Online Properties (CROP) framework to design properties at training time. CROP employs a cost signal to identify unsafe interactions and use them to shape safety properties. Hence, we propose a refinement strategy to combine properties that model similar unsafe interactions. Our evaluation compares the benefits of computing the number of violations using standard hard-coded properties and the ones generated with CROP. We evaluate our approach in several robotic mapless navigation tasks and demonstrate that the violation metric computed with CROP allows higher returns and lower violations over previous Safe DRL approaches.
翻译:安全对于在现实世界情景中部署深强化学习算法至关重要。 最近,提出了核查方法,以便量化在投入产出关系(称为属性)上违反DRL政策的次数。然而,这些属性是硬编码的,需要任务级知识,使其在挑战安全关键任务中难以应用。为此,我们引入了在线属性收集和精炼框架(CROP)来设计培训时的属性。CROP使用成本信号来识别不安全互动并利用它们塑造安全属性。因此,我们提出了一个精细化战略,将类似不安全互动的属性结合起来。我们的评价比较了使用标准硬编码属性和与CROP生成的特性计算违规次数的好处。我们评估了我们在若干无地图的机器人导航任务中采用的方法,并证明与CROP一起计算的违规指标可以比以往的安全 DRL方法产生更高的回报和较低的违规率。