Improving sample-efficiency and safety are crucial challenges when deploying reinforcement learning in high-stakes real world applications. We propose LAMBDA, a novel model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes. Our approach utilizes Bayesian world models, and harnesses the resulting uncertainty to maximize optimistic upper bounds on the task objective, as well as pessimistic upper bounds on the safety constraints. We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
翻译:提高抽样效率和安全性是运用在高超世界应用中强化学习的关键挑战。我们提出LAMBDA,这是在通过限制的Markov决策程序模式构建的安全关键任务方面优化政策的新模式。 我们的方法利用贝叶斯世界模式,利用由此产生的不确定性最大限度地扩大任务目标的乐观上限,以及对安全限制的悲观上限。我们展示了LAMBDA在安全-Gym基准套件上的先进业绩,包括抽样效率和限制。