Safe reinforcement learning is extremely challenging--not only must the agent explore an unknown environment, it must do so while ensuring no safety constraint violations. We formulate this safe reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function, where we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call Doubly Optimistic and Pessimistic Exploration (DOPE), and show that it achieves an objective regret $\tilde{O}(|\mathcal{S}|\sqrt{|\mathcal{A}| K})$ without violating the safety constraints during learning, where $|\mathcal{S}|$ is the number of states, $|\mathcal{A}|$ is the number of actions, and $K$ is the number of learning episodes. Our key idea is to combine a reward bonus for exploration (optimism) with a conservative constraint (pessimism), in addition to the standard optimistic model-based exploration. DOPE is not only able to improve the objective regret bound, but also shows a significant empirical performance improvement as compared to earlier optimism-pessimism approaches.
翻译:安全强化学习极具挑战性,不仅代理商必须探索未知的环境,而且必须这样做,同时确保不出现任何违反安全限制的情况。我们使用一个不为人知的过渡概率函数(CMDP)来制定安全强化学习(RL)问题,我们将安全要求作为限制所有学习阶段必须满足的预期累积成本的模型,我们建议一种基于模型的安全RL算法,我们称之为 Doubly 乐观和悲观性探索(DOPE),并表明它除了在学习期间不违反安全限制,在学习期间不违反安全限制的情况下,实现了一个安全强化学习(RL)问题。 在学习过程中,我们把安全要求作为限制作为限制,在所有学习阶段都必须满足的预期累积成本。我们提出一个基于模型的安全RL算法,我们称之为 Doubly 乐观和悲观性探索(DOPE),我们的主要想法是将一个奖励奖金与保守的制约(philismismismismismismismism)结合起来,除了标准的乐观性乐观性实验方法之外,也只是初步改进了一种重大的乐观性实验性做法。