In many real-world tasks, it is not possible to procedurally specify an RL agent's reward function. In such cases, a reward function must instead be learned from interacting with and observing humans. However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences. Absent significant advances in reward learning, it is thus important to be able to audit learned reward functions to verify whether they truly capture user preferences. In this paper, we investigate techniques for interpreting learned reward functions. In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions. We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment. We also discover that existing interpretability techniques often attend to irrelevant changes in reward output, suggesting that reward interpretability may need significantly different methods from policy interpretability.
翻译:在许多现实世界的任务中,不可能从程序上具体规定RL代理人的奖赏职能。在这种情况下,奖励职能必须从与人类的互动和观察中学习。然而,目前的奖赏学习技术可能无法产生能准确反映用户偏好的奖赏职能。如果在奖赏学习方面没有重大进展,那么重要的是能够审计学到的奖赏职能,以核实它们是否真正抓住了用户的偏好。在本文件中,我们调查解释学到的奖赏职能的技术。特别是,我们采用突出的方法来查明失败模式和预测奖赏职能的稳健性。我们发现,学到的奖励职能往往采用依赖环境的附带因素的惊人的算法。我们还发现,现有的可解释性技术往往会引发奖赏产出的不相干的变化,这表明,奖励的可解释性可能需要与政策可解释性大不相同的方法。