Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence guarantees for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https://sites.google.com/view/conservative-safety-critics/home
翻译:安全探索在强化学习(RL)方面是一个重大挑战:当积极的数据收集工作需要部署部分培训的政策时,我们必须确保这些政策避免灾难性的不安全区域,同时仍然能够进行试验和错误学习。在本文件中,我们通过评论家了解环境国家保守的安全估计,并了解在每次培训循环中发生灾难性失败的可能性,从而将安全探索作为目标。我们在理论上将安全和政策改进之间的权衡区分特征,表明安全限制很可能在培训过程中满足于高概率,为我们的方法获得可实现的趋同保证,这不比标准的RL差,并表明拟议方法在一系列具有挑战性的导航、操纵和移动任务方面的效力。我们经常地表明,拟议方法可以在培训期间实现竞争性的工作表现,同时大大降低培训过程中的灾难性失败率。视频见于此URL https://sites.google.com/view/conservative-sefor-critictr/home。