Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions: value functions that are both optimal for a given task, and enforce safety constraints. We reveal the structure of this relationship through a proof of strong duality, showing that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.
翻译:安全限制和最佳性是十分重要的,但对于控制者来说,安全限制和最佳性是十分重要的,但有时是相互冲突的标准。虽然这些标准往往是用不同的工具分别解决的,以维持正式的保证,但是在强化学习方面也是常见的做法,即仅仅通过惩罚失败来修改奖励功能,而惩罚只是一种累赘。我们严格审查安全和最佳性与惩罚之间的关系,并正式确定安全价值功能的充分条件:价值功能既适合某项任务,又可以强制实施安全限制。我们通过强烈的双重性证明来揭示这种关系的结构,表明始终存在着一种有限的惩罚,从而产生一种安全价值功能。这种惩罚不是独一无二的,而是上限的:较大的惩罚不会损害最佳性。虽然通常不可能计算最低的所需惩罚,但我们清楚地揭示了惩罚、奖励、折扣因素和动态如何相互作用的结构。这种洞察力表明,在安全非常重要的地方,在设计控制问题的奖赏功能时,应该有实用的理论引导的理论。