Safe exploration is a common problem in reinforcement learning (RL) that aims to prevent agents from making disastrous decisions while exploring their environment. A family of approaches to this problem assume domain knowledge in the form of a (partial) model of this environment to decide upon the safety of an action. A so-called shield forces the RL agent to select only safe actions. However, for adoption in various applications, one must look beyond enforcing safety and also ensure the applicability of RL with good performance. We extend the applicability of shields via tight integration with state-of-the-art deep RL, and provide an extensive, empirical study in challenging, sparse-reward environments under partial observability. We show that a carefully integrated shield ensures safety and can improve the convergence rate and final performance of RL agents. We furthermore show that a shield can be used to bootstrap state-of-the-art RL agents: they remain safe after initial learning in a shielded setting, allowing us to disable a potentially too conservative shield eventually.
翻译:安全探索是强化学习(RL)的一个常见问题,目的是防止代理人在探索其环境时作出灾难性决定; 这一问题的各种办法都假定以这种环境的(部分)模式为形式的域知识来决定一项行动的安全性; 所谓的屏蔽迫使RL代理人只选择安全的行动; 但是,为了在各种应用中采用,人们必须超越安全性,并确保RL的可适用性; 我们通过与最先进的深层RL密切结合来扩大屏蔽的适用性,并在具有挑战性、稀疏、不易受部分防守的环境方面提供广泛的经验性研究; 我们表明,仔细整合的屏蔽能确保安全,并能提高RL代理人的趋同率和最终性能; 我们还表明,屏蔽能用于踩踏技术先进的RL代理人的状态:在初步学习屏蔽环境之后,它们仍然安全,从而最终能够禁用可能过于保守的屏蔽。