Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.
翻译:从人类行为中推论奖励功能是价值调整的核心 — — 将AI目标与我们人类实际想要的目标相匹配。 但是,这样做取决于人类行为模式的模型。 经过数十年的认知科学、神经科学和行为经济学研究后,准确的人类模型仍是一个开放的研究主题。 这就提出了这样一个问题:这些模型需要如何准确才能使奖励推论准确?一方面,如果模型中的小错误可能导致灾难性的推论错误,那么整个奖赏学习框架似乎不成熟,因为我们永远没有完美的人类行为模型。另一方面,如果我们的模型改进,我们可以保证奖励准确性也会提高,这将显示在模型方面开展更多工作的好处。我们从理论上和实验上研究这一问题。我们确实表明,不幸的是,在导致推断奖赏中任意出现大错的行为中,有可能形成小的对抗偏差。然而,更重要的是,我们可以找到合理的假设,根据这些推论错误,我们永远不会有完美的人类行为模型,我们最终能够通过直线式的模拟来修正人类数据。