Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions (SVFs): value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.
翻译:安全限制和最佳性是十分重要的,但对于控制者来说,这些标准有时是相互冲突的标准。虽然这些标准往往是用不同的工具分别解决的,以维持正式保证,但在强化学习方面也是常见的做法,即仅仅通过惩罚失败来修改奖励功能,将惩罚视为一种累赘。我们严格审查安全和最佳性与惩罚之间的关系,并正式确定安全价值功能(SVF)的充分条件:价值功能对某项任务来说都是最佳的,并且强制执行安全限制。我们通过检查在最佳控制下维持生存能力时的奖励来显示这一结构,并表明始终存在着一种导致安全价值功能的有限惩罚。这种惩罚不是独一无二的,而是上下限的:更大的惩罚不会损害最佳性。尽管通常无法计算最低的处罚,但我们清楚地揭示了惩罚、奖励、折扣因素和动态如何相互作用的结构。这种洞察力表明,在安全性很重要的地方,在设计控制问题的奖励功能时,如何设计出实用、理论引导的过度性。