The relationship between safety and optimality in control is not well understood, and they are often seen as important yet conflicting objectives. There is a pressing need to formalize this relationship, especially given the growing prominence of learning-based methods. Indeed, it is common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine this relationship, and formalize the requirements for safe value functions: value functions that are both optimal for a given task, and enforce safety. We reveal the structure of this relationship through a proof of strong duality, showing that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.
翻译:安全控制与最佳控制之间的关系没有被很好地理解,它们往往被视为重要但相互冲突的目标。 迫切需要正式确定这种关系, 特别是鉴于学习方法的重要性日益突出。 事实上, 通常的做法是加强学习,通过惩罚失败来修改奖励功能, 惩罚只是一种杂乱无章的处罚。 我们严格审查这种关系,并正式确定安全价值功能的要求: 价值功能对于某项任务来说都是最佳的, 并且执行安全。 我们通过强烈的双重性证明来揭示这种关系的结构, 表明始终存在着一种导致安全价值功能的有限惩罚。 这种惩罚不是独一无二的,而是上下限的: 更大的惩罚不会损害最佳性。 尽管我们常常无法计算最起码的所需惩罚, 但是我们揭示了惩罚、 奖励、 折扣因素 和动态 如何相互作用的明确结构。 这个洞察显示,在安全非常重要的地方, 设计控制问题的奖赏功能是实用的, 理论引导的过度。