As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efficiently evaluate an agent's performance and correctness. In this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values. The goal is to construct a kind of "driver's test" that a human can give to any agent which will verify value alignment via a minimal number of queries. We study alignment verification problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. We analyze verification of exact value alignment for rational agents and propose and analyze heuristic and approximate value alignment verification tests in a wide range of gridworlds and a continuous autonomous driving domain. Finally, we prove that there exist sufficient conditions such that we can verify exact and approximate alignment across an infinite set of test environments via a constant-query-complexity alignment test.
翻译:由于人类与自主代理人互动,以履行日益复杂和潜在风险的任务,必须能够有效地评估代理人的性能和正确性。在本文件中,我们正式确定并理论上分析有效价值调整核查问题:如何有效测试另一个代理人的行为是否与人类的价值观相一致。目标是构建一种“驱动测试”,由人类通过最低数量的查询来核查价值一致性的任何代理人提供这种“驱动测试”。我们研究与既具有明确奖赏功能的理想化人类的校准问题,也研究其隐含价值的问题。我们分析合理代理人准确价值调整的核查,并提议和分析在广泛的电网世界和持续自主驱动领域进行的超常和近似价值调整核查测试。最后,我们证明存在足够的条件,可以通过不断调查的兼容性校准测试,核实在无限的测试环境中的准确和大致一致性。