Value alignment problems arise in scenarios where the specified objectives of an AI agent don't match the true underlying objective of its users. The problem has been widely argued to be one of the central safety problems in AI. Unfortunately, most existing works in value alignment tend to focus on issues that are primarily related to the fact that reward functions are an unintuitive mechanism to specify objectives. However, the complexity of the objective specification mechanism is just one of many reasons why the user may have misspecified their objective. A foundational cause for misalignment that is being overlooked by these works is the inherent asymmetry in human expectations about the agent's behavior and the behavior generated by the agent for the specified objective. To address this lacuna, we propose a novel formulation for the value alignment problem, named goal alignment that focuses on a few central challenges related to value alignment. In doing so, we bridge the currently disparate research areas of value alignment and human-aware planning. Additionally, we propose a first-of-its-kind interactive algorithm that is capable of using information generated under incorrect beliefs about the agent, to determine the true underlying goal of the user.
翻译:如果一个AI代理商的具体目标与其用户的真正基本目标不符,则会出现价值调整问题。这个问题被广泛认为是AI中的核心安全问题之一。不幸的是,大多数现有的价值调整工作往往侧重于主要与奖励功能是确定目标的不直观机制有关的各种问题。然而,客观规格机制的复杂性只是用户可能错误确定其目标的许多原因之一。这些作品所忽略的造成不匹配的一个根本原因是,人们对该代理商行为和该代理商为特定目标所产生行为的期望本身不对称。为了解决这一缺陷,我们建议为价值调整问题制定新的表述,指定目标调整侧重于与价值调整有关的几个中心挑战。在这样做时,我们弥合了目前价值调整和人类认知规划这两个互不相干的研究领域。此外,我们提议一种首选的互动式算法,能够利用根据对代理人的不正确的信念所产生的信息,确定用户的真正基本目标。