Safe reinforcement learning (RL) aims to learn policies that satisfy certain constraints before deploying to safety-critical applications. Primal-dual as a prevalent constrained optimization framework suffers from instability issues and lacks optimality guarantees. This paper overcomes the issues from a novel probabilistic inference perspective and proposes an Expectation-Maximization style approach to learn safe policy. We show that the safe RL problem can be decomposed to 1) a convex optimization phase with a non-parametric variational distribution and 2) a supervised learning phase. We show the unique advantages of constrained variational policy optimization by proving its optimality and policy improvement stability. A wide range of experiments on continuous robotic tasks show that the proposed method achieves significantly better performance in terms of constraint satisfaction and sample efficiency than primal-dual baselines.
翻译:安全强化学习(RL)旨在学习在部署到安全关键应用之前能够满足某些限制的政策。作为普遍限制优化框架的原始二元,存在不稳定问题,缺乏最佳性保障。本文从新颖的概率推论角度克服了问题,并提出了学习安全政策的预期-最大化方法。我们表明,安全RL问题可以分解为:1)一个非参数变异分布的连接优化阶段,2)一个有监督的学习阶段。我们通过证明其最佳性和政策改进稳定性,展示了限制的变异政策优化的独特优势。关于连续机器人任务的广泛实验表明,在限制满意度和抽样效率方面,拟议方法取得了比原始基线更好的业绩。