Safe exploration is a key to applying reinforcement learning (RL) in safety-critical systems. Existing safe exploration methods guaranteed safety under the assumption of regularity, and it has been difficult to apply them to large-scale real problems. We propose a novel algorithm, SPO-LF, that optimizes an agent's policy while learning the relation between a locally available feature obtained by sensors and environmental reward/safety using generalized linear function approximations. We provide theoretical guarantees on its safety and optimality. We experimentally show that our algorithm is 1) more efficient in terms of sample complexity and computational cost and 2) more applicable to large-scale problems than previous safe RL methods with theoretical guarantees, and 3) comparably sample-efficient and safer compared with existing advanced deep RL methods with safety constraints.
翻译:安全勘探是安全临界系统中应用强化学习的关键。现有的安全勘探方法在假设正常情况下保证安全,而且很难将其应用于大规模的实际问题。我们建议一种新型算法,即SPO-LF, 优化代理人的政策,同时学习传感器获得的当地特点与环境奖励/安全之间的关系,我们对其安全和最佳性提供理论保证。我们实验性地表明,我们的算法在抽样复杂性和计算成本方面是1)效率更高的,2)比以往具有理论保证的安全RL方法更适用于大规模问题,3)与具有安全限制的先进RL方法相比,抽样效率和安全性较高。