We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using Causal Influence Diagram analysis. The resulting agents have no incentive to control the delicate state. We further show how our framework unifies and generalizes existing proposals.
翻译:我们提出了一个培训安全剂的一般框架,这些安全剂的天真激励是不安全的。举例来说,操纵或欺骗行为可以改善奖励,但应该避免。大多数方法在这里都失败:代理尽可能地以任何必要手段实现预期的回报。我们正式描述国家“复杂”部分的情景,不应将其用作达到目的的手段。然后我们培训代理,以便尽可能扩大行动对预期回报的因果关系,而这些行动不是由国家的微妙部分所调解的,我们使用“影响图”分析。由此产生的代理没有控制微妙状态的动力。我们进一步展示我们的框架如何统一和概括现有的建议。