In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. We therefore consider a constrained MDP where the constraints are probabilistic. Since there is no straightforward way to optimize the policy with respect to the probabilistic constraint in a reinforcement learning framework, we propose an ergodic relaxation of the problem. The advantages of the proposed relaxation are threefold. (i) The safety guarantees are maintained in the case of episodic tasks and they are kept up to a given time horizon for continuing tasks. (ii) The constrained optimization problem despite its non-convexity has arbitrarily small duality gap if the parametrization of the policy is rich enough. (iii) The gradients of the Lagrangian associated with the safe-learning problem can be easily computed using standard policy gradient results and stochastic approximation tools. Leveraging these advantages, we establish that primal-dual algorithms are able to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.
翻译:在本文中,我们研究在确定强化学习问题时学习安全政策的学习。这就是说,我们的目标是控制一个我们不了解过渡概率的Markov 决策程序(MDP),但通过经验,我们能够使用样本轨迹。我们把安全定义为在操作时间里仍留在一个理想的安全地点,机率很高。因此,我们认为,在限制是概率的方面,一个受限制的MDP是一个受限制的MDP。由于在加强学习框架中没有最优化关于概率限制的政策的直接方法,我们建议放宽问题。提议的放松的优点是三重:(一) 安全保障措施在附带性任务的情况下得以维持,并且能够维持在一个特定的时间范围内,以便继续执行任务。 (二) 最有限的优化问题尽管不协调,但如果政策的平衡性足够丰富,则会存在非常小的双重性差距。 (三) 与安全学习问题相关的拉格朗吉梯度问题可以很容易地用标准政策梯度结果和最能动的轨迹法工具来计算。我们要在不断调整的政策轨迹上找到一种最先进的环境、最能检验的方法。