Cost functions are commonly employed in Safe Deep Reinforcement Learning (DRL). However, the cost is typically encoded as an indicator function due to the difficulty of quantifying the risk of policy decisions in the state space. Such an encoding requires the agent to visit numerous unsafe states to learn a cost-value function to drive the learning process toward safety. Hence, increasing the number of unsafe interactions and decreasing sample efficiency. In this paper, we investigate an alternative approach that uses domain knowledge to quantify the risk in the proximity of such states by defining a violation metric. This metric is computed by verifying task-level properties, shaped as input-output conditions, and it is used as a penalty to bias the policy away from unsafe states without learning an additional value function. We investigate the benefits of using the violation metric in standard Safe DRL benchmarks and robotic mapless navigation tasks. The navigation experiments bridge the gap between Safe DRL and robotics, introducing a framework that allows rapid testing on real robots. Our experiments show that policies trained with the violation penalty achieve higher performance over Safe DRL baselines and significantly reduce the number of visited unsafe states.
翻译:成本功能通常用于安全深层强化学习(DRL)中。然而,成本通常被作为一种指标功能编码,因为很难量化州空间决策的风险。这种编码要求代理商访问许多不安全国家,学习成本价值功能,以推动学习走向安全。因此,增加了不安全互动的次数,降低了抽样效率。在本文件中,我们调查了一种替代方法,即使用域知识,通过界定违反指标来量化这些国家附近的风险。这一指标的计算方法是核实任务级别特性,形成为输入输出条件,并作为一种惩罚手段,在不学习额外价值功能的情况下,将政策偏向不安全国家。我们调查在标准安全DRL基准和机器人无地图导航任务中使用违规指标的好处。导航实验缩小了安全DRL与机器人之间的差距,引入了一个框架,允许对真正的机器人进行快速测试。我们的实验表明,受违反处罚培训的政策在安全DRL基线上取得了更高的性能,并大大减少了被访问的不安全国家的数量。