A key challenge in applying reinforcement learning to safety-critical domains is understanding how to balance exploration (needed to attain good performance on the task) with safety (needed to avoid catastrophic failure). Although a growing line of work in reinforcement learning has investigated this area of "safe exploration," most existing techniques either 1) do not guarantee safety during the actual exploration process; and/or 2) limit the problem to a priori known and/or deterministic transition dynamics with strong smoothness assumptions. Addressing this gap, we propose Analogous Safe-state Exploration (ASE), an algorithm for provably safe exploration in MDPs with unknown, stochastic dynamics. Our method exploits analogies between state-action pairs to safely learn a near-optimal policy in a PAC-MDP sense. Additionally, ASE also guides exploration towards the most task-relevant states, which empirically results in significant improvements in terms of sample efficiency, when compared to existing methods.
翻译:将强化学习应用到安全关键领域的一个关键挑战是了解如何在勘探(为在任务上取得良好业绩而需要的)和安全(避免灾难性失败)之间取得平衡。 虽然在强化学习方面越来越多的工作已经调查了“安全勘探”领域,但大多数现有技术都1个或1个不能保证实际勘探过程中的安全;和(或)2 将这一问题限制在事先已知的和(或)决定性的过渡动态上,并具有很强的平稳假设。 解决这一差距,我们提出了“模拟安全状态勘探”(ASE),这是在具有未知、随机动态的MDP进行可察觉的安全探索的一种算法。 我们的方法利用州际行动对口之间的类似方法,安全地学习PAC-MDP意义上的近最佳政策。 此外,ASE还指导探索走向最与任务相关的状态,与现有方法相比,在抽样效率方面,从经验上取得显著改善。