If capable AI agents are generally incentivized to seek power in service of the objectives we specify for them, then these systems will pose enormous risks, in addition to enormous benefits. In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive. However, the real world is neither fully observable, nor will agents be perfectly optimal. We consider a range of models of AI decision-making, from optimal, to random, to choices informed by learning and interacting with an environment. We discover that many decision-making functions are retargetable, and that retargetability is sufficient to cause power-seeking tendencies. Our functional criterion is simple and broad. We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power. We demonstrate the flexibility of our results by reasoning about learned policy incentives in Montezuma's Revenge. These results suggest a safety risk: Eventually, highly retargetable training procedures may train real-world agents which seek power over humans.
翻译:如果一般都鼓励有能力的AI代理商为我们为其规定的目标寻求权力,那么这些系统除了带来巨大的好处外,还将带来巨大的风险。在完全可见的环境中,大多数奖励功能都有一个最佳的政策,通过保持选择的开放和生存来寻求权力。然而,现实世界并不完全可见,代理商也不会完全最佳。我们考虑了一系列AI决策模式,从最佳到随机,到通过学习和与环境互动而知情的选择。我们发现许多决策功能可以重新瞄准,再瞄准足以造成权力寻求的趋势。我们的职能标准简单而广泛。我们表明一系列在质量上与标准不同的决策程序能够激励代理商寻求权力。我们通过在蒙特祖马的复仇中学习政策激励的推理,展示了我们结果的灵活性。这些结果表明一种安全风险:最终,高度可重新瞄准的培训程序可以培训真实世界的代理商,他们寻求对人类的权力。