In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for ''safe'' RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret $\tilde{O}(\frac{d H^3 \sqrt{dK}}{\Delta_c})$ that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where $d$ is the feature-mapping dimension, $K$ is the number of episodes, $H$ is the number of steps in each episode, and $\Delta_c$ is a safety-related parameter. We also provide a lower bound $\tilde{\Omega}(\max\{dH \sqrt{K}, \frac{H}{\Delta_c^2}\})$, which indicates that the dependency on $\Delta_c$ is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest.
翻译:在许多“ 强化学习 ” (RL) 应用中,关键是算法必须安全进行, 以便每步都能够满足瞬时困难的制约, 避免不安全的状态和行动。 但是, “ 安全” 的现有算法往往在以下制约下设计: 要么要求将预期累积成本捆绑起来, 要么假设所有各州都是安全的。 因此, 这样的算法可能违反瞬时困难限制, 并在实践中扭曲不安全状态( 和行动 ) 。 因此, 在本文中, 我们开发了第一个接近最佳安全的 RL 算法, 用于具有突发困难状态的Markov 决策进程, 以及处于瞬时困难限制和线性混合物模式下的行动。 它不仅实现遗憾$\ telde{O} (\ dH3\\\\ dqrt{d\\ d\ Delta_ c} 。 美元可以紧紧紧地匹配当前状态的遗憾, 只有不安全的动作, 并且几乎符合每个步骤( $_ D$_ d_ d_ d_ d_ d_ d_ ta lax) 和我们不同的设计。