Self-modification of agents embedded in complex environments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing the operator, exploiting bugs or the environment). It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals. Everitt et al. (2016) formally show that providing an option to self-modify is harmless for perfectly rational agents. We show that this result is no longer true for agents with bounded rationality. In such agents, self-modification may cause exponential deterioration in performance and gradual misalignment of a previously aligned agent. We investigate how the size of this effect depends on the type and magnitude of imperfections in the agent's rationality (1-4 below). We also discuss model assumptions and the wider problem and framing space. We examine four ways in which an agent can be bounded-rational: it either (1) doesn't always choose the optimal action, (2) is not perfectly aligned with human values, (3) has an inaccurate model of the environment, or (4) uses the wrong temporal discounting factor. We show that while in the cases (2)-(4) the misalignment caused by the agent's imperfection does not increase over time, with (1) the misalignment may grow exponentially.
翻译:复杂的环境中嵌入物剂的自我调整很难避免,无论是直接手段(如自己的编码修改)还是间接手段(如影响操作者、利用虫子或环境)或间接手段(如影响操作者、利用虫子或环境)发生。 据认为,智能剂有避免改变其效用功能的动机,以便其未来的事例有利于同一目标。 Everitt 等人(2016年)正式表明,为自我调整提供选择对完全理性物剂是无害的。我们表明,对于受约束的物剂来说,这一结果已不再是真实的。在此种物剂中,自我调整可能导致性能迅速恶化,并逐渐使先前与之结盟的物剂逐渐不匹配。我们调查这种效果的规模如何取决于该物剂理性性(下文第1-4段)的不完善性的类型和程度。我们还讨论模型假设和更广泛的问题和框架空间。我们研究了一种方法,即一个物剂可以相互连接,即不是总选择最佳行动,就是不完全符合人类价值观的物理,就是有不精确的环境模型,或者(4)使用错误的时间折误的物剂。 我们调查该物态因素导致不精确的时差因素。