Safety in reinforcement learning has become increasingly important in recent years. Yet, existing solutions either fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems, or fail to provide regret guarantees for settings where safety constraints need to be learned. In this paper, we address both problems by first modeling safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold. We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation. We show that SLUCB-QVI and RSLUCB-QVI, while with \emph{no safety violation}, achieve a $\tilde{\mathcal{O}}\left(\kappa\sqrt{d^3H^3T}\right)$ regret, nearly matching that of state-of-the-art unsafe algorithms, where $H$ is the duration of each episode, $d$ is the dimension of the feature mapping, $\kappa$ is a constant characterizing the safety constraints, and $T$ is the total number of action plays. We further present numerical simulations that corroborate our theoretical findings.
翻译:然而,现有的解决方案要么没有严格避免选择不安全行动,这可能导致安全临界系统中的灾难性结果,要么没有为需要学习安全限制的环境提供遗憾的保证。在本文件中,我们首先将安全作为国家和行动的未知线性成本函数进行模拟,这必须总是低于某一阈值。然后我们提出算法,称为SLUCB-QVI和RSLUCB-QVI,用于具有直线函数近似的分数马可夫决策过程。我们显示SLUCB-QVI和RSLUCB-QVI在进行地貌制图时,没有为需要学习安全限制的环境提供遗憾保证。在本文件中,我们首先将安全作为国家和行动的未知线性成本函数来进行模拟,结果总是低于某一阈值。我们随后提出了算法,称为SLUCB-QVI和RSLUCB-QVI, 以美元为直线函数近似值。我们展示了地图的层面,美元和RSLUBBB-Q$,而我们所呈现的理论性的安全限制和数字。