The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. However, comparing reward functions, for example as a means of evaluating reward learning methods, presents a challenge. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. To address this challenge, Gleave et al. (2020) propose the Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy optimization, but in doing so requires computing reward values at transitions that may be impossible under the system dynamics. This is problematic for learned reward functions because it entails evaluating them outside of their training distribution, resulting in inaccurate reward values that we show can render EPIC ineffective at comparing rewards. To address this problem, we propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric. DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution. Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.
翻译:学习奖赏功能的能力对于在现实世界中部署智能代理人具有重要作用。然而,将奖赏功能(例如作为评估奖赏学习方法的一种手段)进行比较是一项挑战。奖赏功能通常通过考虑优化政策的行为进行比较,但这种方法将奖赏功能的缺陷与用于优化奖励的政策搜索算法的缺陷混为一谈。为了应对这一挑战,Gleft et al. (202020) 提出了“等效-政策差异比较(EPIC)距离(EPIC) ” 。 EPIC 避免了政策优化,但在这样做时,需要在系统动态下不可能实现的过渡中计算奖赏价值。这对学习的奖赏功能有问题,因为它意味着要在培训分配之外对它们进行评分,从而导致不准确的奖赏价值,从而使EPIC在比较奖赏功能时产生不准确的效益。为了解决这一问题,我们建议采用“动态-Award距离(DARD)”(2020)这一新的伪度。DARD使用一种环境的大致过渡模式,将奖赏功能转化为一种可以比较的形式,以便进行比较,而只评价接近于奖赏的奖赏功能,而只评价接近于不至培训最接近于其培训分配的奖赏功能,同时评价,而没有评估其奖赏分配的奖赏功能,因此,在模拟地进行实验法则在模拟物理上进行更精确的实验。