Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.
翻译:安全政策改进(STI)是安全关键应用中脱线强化学习的重要技术,因为它能提高行为政策的可能性。 我们根据文献中各种SPI方法如何利用州-行动对方的不确定性,将它们分为两类。 侧重于软- SPIB(安全政策改进与软基线推进)算法,我们显示,他们声称安全性不强的说法站不住脚。 基于这一发现,我们开发了适应、Adv-Soft- SPIB算法,并显示它们非常安全。 超常适应、低Approx-Soft-SPIB(低Aprox-SPIBB)在两个基准的广泛实验中取得了所有SPIB算法的最佳性。 我们还检查了可变安全算法的安全保障,并表明大量数据是必要的,因此安全界限在实践中变得有用。