This paper investigates reinforcement learning with constraints, which are indispensable in safety-critical environments. To drive the constraint violation monotonically decrease, we take the constraints as Lyapunov functions and impose new linear constraints on the policy parameters' updating dynamics. As a result, the original safety set can be forward-invariant. However, because the new guaranteed-feasible constraints are imposed on the updating dynamics instead of the original policy parameters, classic optimization algorithms are no longer applicable. To address this, we propose to learn a generic deep neural network (DNN)-based optimizer to optimize the objective while satisfying the linear constraints. The constraint-satisfaction is achieved via projection onto a polytope formulated by multiple linear inequality constraints, which can be solved analytically with our newly designed metric. To the best of our knowledge, this is the \textit{first} DNN-based optimizer for constrained optimization with the forward invariance guarantee. We show that our optimizer trains a policy to decrease the constraint violation and maximize the cumulative reward monotonically. Results on numerical constrained optimization and obstacle-avoidance navigation validate the theoretical findings.
翻译:本文调查了在安全临界环境下不可或缺的强化学习, 这些限制对安全临界环境中不可或缺的强化学习。 为了推动限制违反单质减少, 我们将限制作为 Lyapunov 函数, 并对政策参数更新动态施加新的线性限制。 因此, 原安全套可以是前瞻性的。 但是, 由于对更新动态施加了新的有保证的可行限制, 而不是最初的政策参数, 经典优化算法不再适用。 为了解决这个问题, 我们提议学习一个通用的深神经网络( DNNN) 优化器, 以便在满足线性限制的同时优化目标。 限制满足性是用多线性不平等限制对多功能进行投射的方式实现的。 限制- 满足性通过投射到一个多线性限制所形成的多功能, 可以通过我们新设计的参数来分析解决。 对我们所知的, 这是基于 DNNN 的优化优化, 以限制性与前摄性保证相结合。 我们表示, 我们的优化器将培训一项政策, 以减少限制违反限制, 并最大限度地增加累积的奖励单项。 有关数字限制优化和避免障碍导航的结果证实了理论结论 。