Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case for expecting that, without substantial effort to prevent it, AGIs could learn to pursue goals which are undesirable (i.e. misaligned) from a human perspective. We argue that if AGIs are trained in ways similar to today's most capable models, they could learn to act deceptively to receive higher reward, learn internally-represented goals which generalize beyond their training distributions, and pursue those goals using power-seeking strategies. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing this outcome.
翻译:在未来几十年内,人造一般情报(AGI)可能会在一系列重要任务中超越人的能力。我们列举一个理由,说明如果没有作出重大努力防止,AGI可以学习从人的角度追求不可取的目标(即错位),我们认为,如果AGI接受与当今最有能力的模式相似的培训,他们可以学会欺骗行为,以获得更高的奖励,学习超出其培训分布范围的内部代表制目标,并运用寻求权力的战略追求这些目标。我们概述了部署错误的AGIs会如何不可逆转地破坏人类对世界的控制,并简要回顾旨在防止这一结果的研究方向。