不确定环境下的安全保障：基于随机阈值的约束马尔可夫决策过程 (Ensuring Safety in an Uncertain Environment: Constrained MDPs via Stochastic Thresholds)

This paper studies constrained Markov decision processes (CMDPs) with constraints against stochastic thresholds, aiming at safety of reinforcement learning in unknown and uncertain environments. We leverage a Growing-Window estimator sampling from interactions with the uncertain environment to estimate the thresholds, based on which we design Stochastic Pessimistic-Optimistic Thresholding (SPOT), a novel model-based primal-dual algorithm for multiple constraints against stochastic thresholds. SPOT enables reinforcement learning under both pessimistic and optimistic threshold settings. We prove that our algorithm achieves sublinear regret and constraint violation; i.e., a reward regret of $\tilde{\mathcal{O}}(\sqrt{T})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{T})$ constraint violation over $T$ episodes. The theoretical guarantees show that our algorithm achieves performance comparable to that of an approach relying on fixed and clear thresholds. To the best of our knowledge, SPOT is the first reinforcement learning algorithm that realises theoretical guaranteed performance in an uncertain environment where even thresholds are unknown.

翻译：本文研究具有随机阈值约束的约束马尔可夫决策过程，旨在实现未知与不确定环境中强化学习的安全性。我们利用从不确定环境交互中采样的增长窗口估计器来估计阈值，并在此基础上设计了随机悲观-乐观阈值算法——一种针对多重随机阈值约束的新型基于模型的原对偶算法。该算法支持在悲观与乐观阈值设定下进行强化学习。我们证明该算法实现了次线性遗憾与约束违反，即在T个回合中达成$\tilde{\mathcal{O}}(\sqrt{T})$的奖励遗憾，同时允许$\tilde{\mathcal{O}}(\sqrt{T})$的约束违反。理论保证表明，该算法能达到与依赖固定明确阈值方法相当的性能。据我们所知，该算法是首个在阈值未知的不确定环境中实现理论性能保障的强化学习算法。