This paper considers the problem of learning safe policies in the context of reinforcement learning (RL). In particular, we consider the notion of probabilistic safety. This is, we aim to design policies that maintain the state of the system in a safe set with high probability. This notion differs from cumulative constraints often considered in the literature. The challenge of working with probabilistic safety is the lack of expressions for their gradients. Indeed, policy optimization algorithms rely on gradients of the objective function and the constraints. To the best of our knowledge, this work is the first one providing such explicit gradient expressions for probabilistic constraints. It is worth noting that the gradient of this family of constraints can be applied to various policy-based algorithms. We demonstrate empirically that it is possible to handle probabilistic constraints in a continuous navigation problem.
翻译:----
针对强化学习(RL)中学习安全策略的问题展开研究,本文特别考虑了概率安全的概念。也就是说,我们旨在设计在高概率下将系统状态维持在安全集合中的策略。这种概率安全的概念与文献中常考虑的累积约束不同。处理概率安全的挑战在于缺乏它们的梯度表达式。事实上,政策优化算法依赖于目标函数和约束的梯度。据我们所知,本文是首篇提供此类概率约束明确梯度表达式的工作。值得注意的是,这种约束的梯度适用于各种基于策略的算法。我们通过实证研究证明,在连续导航问题中处理概率约束是可行的。