Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions: value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.
翻译:安全限制和最佳性是十分重要的,但对于控制者来说,这些标准有时是相互冲突的。虽然这些标准往往用不同的工具分别解决,以维持正式的保证,但是在强化学习中也是一种常见的做法,即仅仅通过惩罚失败来修改奖励功能,而惩罚只是一种累赘。我们严格地审查安全和最佳性与惩罚之间的关系,并正式确定安全价值功能的充分条件:价值功能既适合某项任务,又可以强制实施安全限制。我们通过审查奖励在最佳控制下是否保持可行性来揭示这一结构,并表明始终存在着一种导致安全价值功能的有限惩罚。这种惩罚不是独一无二的,而是上限的:较大的惩罚不会损害最佳性。虽然通常不可能计算最低必要的惩罚,但我们清楚地揭示了惩罚、奖励、折扣因素和动态如何相互作用的结构。这种洞察力表明,在安全非常重要的情况下,在设计控制问题的奖赏功能时,有实用、理论引导的理论的理论。