For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https://github.com/HumanCompatibleAI/evaluating-rewards.
翻译:对于许多任务来说,奖励功能是无法反省的,或者过于复杂,无法在程序上具体说明,而是必须从用户数据中学习。先前的工作通过评价最佳的奖赏政策,评价了学到的奖赏功能;然而,这种方法无法区分学习的奖赏功能未能反映用户的偏好,而政策优化进程未能优化获得的奖赏。此外,这一方法只能告诉我们评价环境中的行为,但奖励可能激励即使是在稍微不同的部署环境中也存在非常不同的行为。为了解决这些问题,我们引入了等值-政策差异性比较(EPIC)距离,直接量化两种奖赏功能之间的差异,而不采取政策优化步骤。我们证明EPIC在等值的奖赏功能类别上没有差异性,总是产生同样的最佳政策。此外,我们发现EPIC可以有效地近似和强于选择覆盖面分配的基线。最后,我们证明EPIC的距离限制了最佳政策的遗憾,即使在不同的过渡动态下也是如此,我们从经验上确认它预测政策培训成功与否。我们的源代码可以在 http://githureub/Commalatevab.