We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether -- absent a full solution to this AI alignment problem -- we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power. In this thesis, I introduce the attainable utility preservation (AUP) method. I demonstrate that AUP produces conservative, option-preserving behavior within toy gridworlds and within complex environments based off of Conway's Game of Life. I formalize the problem of side effect avoidance, which provides a way to quantify the side effects an agent had on the world. I also give a formal definition of power-seeking in the context of AI agents and show that optimal policies tend to seek power. In particular, most reward functions have optimal policies which avoid deactivation. This is a problem if we want to deactivate or correct an intelligent agent after we have deployed it. My theorems suggest that since most agent goals conflict with ours, the agent would very probably resist correction. I extend these theorems to show that power-seeking incentives occur not just for optimal decision-makers, but under a wide range of decision-making procedures.
翻译:我们不知道如何将一个非常聪明的AI代理商的行为与人类利益挂钩。我调查 -- -- 如果没有完全解决AI调整问题 -- -- 我们是否可以建立智能AI代理商,这些代理商对世界影响有限,而且不会自主寻求权力。在这个论文中,我引入了可以实现的公用事业保护(AUP)方法。我证明AUP在以康威人生游戏为基础的玩具网格世界和复杂环境中产生了保守的、保护选择权的行为。我正式确定了避免副作用的问题,这为量化一个代理商对世界的副作用提供了一种途径。我还给出了在AI代理商中寻求权力的正式定义,并表明最佳政策倾向于寻求权力。特别是,大多数奖励功能都有避免停用的最佳政策。如果我们想在部署智能代理商之后停止使用或纠正它,这是一个问题。我的理论表明,由于大多数代理商的目标与我们的目标相冲突,该代理商很可能抵制纠正。我将这些理论扩展了,表明寻求权力的激励机制不仅针对最佳决策者,而且是在广泛的决策程序下发生。