安全探索,利用可允许的保障进行安全探索,以进行有可担保的强化学习 (Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees)

We consider the problem of learning an episodic safe control policy that minimizes an objective function, while satisfying necessary safety constraints -- both during learning and deployment. We formulate this safety constrained reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function. Here, we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm, and show that it achieves an $\tilde{\mathcal{O}}(S^{2}\sqrt{A H^{7}K}/ (\bar{C} - \bar{C}_{b}))$ cumulative regret without violating the safety constraints during learning, where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon length, $K$ is the number of learning episodes, and $(\bar{C} - \bar{C}_{b})$ is the safety gap, i.e., the difference between the constraint value and the cost of a known safe baseline policy. The scaling as $\tilde{\mathcal{O}}(\sqrt{K})$ is the same as the traditional approach where constraints may be violated during learning, which means that our algorithm suffers no additional regret in spite of providing a safety guarantee. Our key idea is to use an optimistic exploration approach with pessimistic constraint enforcement for learning the policy. This approach simultaneously incentivizes the exploration of unknown states while imposing a penalty for visiting states that are likely to cause violation of safety constraints. We validate our algorithm by evaluating its performance on benchmark problems against conventional approaches.

翻译：我们考虑的是,在学习和部署期间,在满足必要的安全限制的同时,如何学习一个不折不扣的安全控制政策,以尽可能降低客观功能,同时满足必要的安全限制。我们使用一个不为人知的过渡概率函数,在限定的Horizon Constrad Markov 决策程序(CMDP)的框架内,制定安全限制强化学习(RL)问题。在这里,我们将安全要求作为在所有学习过程中必须满足的预期累积成本的限制因素。我们建议一种基于模型的安全RL算法,称之为乐观-悲观安全强化学习(OPSRL)算法(OPSRL),并表明它实现了美元限制强化学习学习(OPSRL)的学习(OPR) 。我们的安全限制(Oxlational discountal) 政策(Servicol) 以不为标准, 以美元为标准, 以美元为标准, 以美元为标准, 以美元为标准, 以美元为标准, 以我们学习次数为标准, 以美元为标准, 以我们学习次数为标准以美元, 以美元为标准为标准, 以不折差为标准以不计。