在无示范培训强化学习中学习零约束与破坏政策 (Learn Zero-Constraint-Violation Policy in Model-Free Constrained Reinforcement Learning)

In the trial-and-error mechanism of reinforcement learning (RL), a notorious contradiction arises when we expect to learn a safe policy: how to learn a safe policy without enough data and prior model about the dangerous region? Existing methods mostly use the posterior penalty for dangerous actions, which means that the agent is not penalized until experiencing danger. This fact causes that the agent cannot learn a zero-violation policy even after convergence. Otherwise, it would not receive any penalty and lose the knowledge about danger. In this paper, we propose the safe set actor-critic (SSAC) algorithm, which confines the policy update using safety-oriented energy functions, or the safety indexes. The safety index is designed to increase rapidly for potentially dangerous actions, which allows us to locate the safe set on the action space, or the control safe set. Therefore, we can identify the dangerous actions prior to taking them, and further obtain a zero constraint-violation policy after convergence.We claim that we can learn the energy function in a model-free manner similar to learning a value function. By using the energy function transition as the constraint objective, we formulate a constrained RL problem. We prove that our Lagrangian-based solutions make sure that the learned policy will converge to the constrained optimum under some assumptions. The proposed algorithm is evaluated on both the complex simulation environments and a hardware-in-loop (HIL) experiment with a real controller from the autonomous vehicle. Experimental results suggest that the converged policy in all environments achieves zero constraint violation and comparable performance with model-based baselines.

翻译：在强化学习的试错机制(RL)中,当我们期望学习安全的政策时,就会出现一个臭名昭著的矛盾:如何学习安全的政策而没有足够的数据和安全区域的先前模式?现有方法大多对危险行动使用事后惩罚,这意味着代理人在面临危险之前不会受到惩罚。这一事实导致代理人即使在趋同之后也无法学习零违反政策。否则,它不会受到任何惩罚,并失去对危险的了解。在本文中,我们提议采用安全性行为者-批评(SSAC)算法,该算法将政策更新限于安全性能源功能,或安全指数。安全指数的设计是为了迅速增加可能危险的行动,从而使我们能够在行动空间或控制安全套上找到安全套,这意味着代理人在遇到危险之前不会受到惩罚。因此,我们可以确定代理人即使在趋同之后也无法学到零违反政策的政策。我们声称,我们可以以无模式的方式学习能源功能,类似于学习所有价值函数。通过将能源功能转换为制约性模式,我们制定了一种限制性的政策模型,我们制定了一种限制性RL政策模型,在最佳的操作环境中,我们用一种比较性标准来评估我们提出的硬的硬性标准。