ProSh：无模型强化学习的概率屏蔽 (ProSh: Probabilistic Shielding for Model-free Reinforcement Learning)

Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent's policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.

翻译：安全性是强化学习中的一个主要关切点：我们的目标是开发不仅性能最优，而且部署安全的强化学习系统，为其安全性提供形式化保证。为此，我们引入了通过风险增强的概率屏蔽，这是一种在成本约束下进行安全强化学习的无模型算法。该方法通过风险预算来增强约束马尔可夫决策过程的状态空间，并利用学习到的成本评论家对智能体的策略分布施加屏蔽来确保安全性。该屏蔽机制确保所有采样的动作在期望意义上保持安全。我们还证明了在环境确定性的情况下，最优性得以保持。由于该方法是完全无模型的，训练期间的安全性取决于我们对环境已掌握的知识。我们提供了一个仅依赖于备份评论家准确度的、关于期望成本的紧上界，该上界在训练过程中始终成立。在温和且实际可实现的假设下，该方法即使在训练期间也能保证安全性，实验验证了这一点。