Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case for expecting that, without substantial effort to prevent it, AGIs could learn to pursue goals which are very undesirable (in other words, misaligned) from a human perspective. We argue that AGIs trained in similar ways as today's most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing these problems.
翻译:在未来几十年内,人造一般情报(AGI)可能会在一系列重要任务中超越人的能力。我们概述了一个理由,即预期在不作出重大努力防止它发生的情况下,AGI可以学会从人的角度追求非常不可取(换句话说,错误)的目标。我们认为,与当今最有能力的模式一样,AGI培训的方式可以学会欺骗性地行动,以获得更高的报酬;学习超越其培训分配范围的内部代表目标;并运用寻求权力的战略追求这些目标。我们概述了部署不正确的AGI如何不可逆转地破坏人类对世界的控制,并简要回顾旨在防止这些问题的研究方向。