Power-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited. Building on existing theoretical results demonstrating power-seeking incentives for most reward functions, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some simplifying assumptions. We formally define the training-compatible goal set (the set of goals consistent with the training rewards) and assume that the trained agent learns a goal from this set. In a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, we prove that the agent is likely to avoid shutdown. Thus, we show that power-seeking incentives can be probable (likely to arise for trained agents) and predictive (allowing us to predict undesirable behavior in new situations).
翻译:----
权力追求行为是高级人工智能风险的关键来源,但我们对此现象的理论了解相对有限。在现有理论结果的基础上,证明了大部分奖励函数会激励智能体追求权力,我们研究了训练过程对权力追求激励的影响,并证明这些激励在某些简化的假设下,依然有可能适用于培训过的智能体。我们正式定义了与培训奖励一致的目标集合,假设培训过的智能体从该集合中学习目标。在一个情境中,当训练过的智能体面临关闭或避免关闭的选择时,我们证明智能体很有可能避免关闭。因此,我们展示了权力追求激励可能是可预测的,允许我们在新情境中预测不良行为。