Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher's preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark. In this paper, we introduce B-Pref: a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly, which makes relying on real human input for evaluation prohibitive. At the same time, simulating human input as giving perfect preferences for the ground truth reward function is unrealistic. B-Pref alleviates this by simulating teachers with a wide array of irrationalities, and proposes metrics not solely for performance but also for robustness to these potential irrationalities. We showcase the utility of B-Pref by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preference-based RL algorithms. We hope that B-Pref can serve as a common starting point to study preference-based RL more systematically. Source code is available at https://github.com/rll-research/B-Pref.
翻译:强化学习(RL)要求获得奖励功能,激励正确的行为,但这种功能很难为复杂的任务指定。基于优惠的学习(RL)提供了另一种选择:利用教师的偏好而采用教师的偏好而不预先确定奖赏,从而克服与奖赏工程相关的关切。然而,由于缺乏普遍采用的基准,很难量化基于优惠的学习(RL)的进展。在本文件中,我们引入了B-Pref:为基于优惠的RL专门设计的基准。这样一个基准的主要挑战正在提供快速评估候选人算法的能力,这导致依赖真正的人类投入来进行评价。同时,模拟人的投入,为地面真理奖赏功能提供完美的偏好是不现实的。B-Pref通过以广泛的不合理性激励教师来缓解这一点,并且提出衡量标准,不仅是为了业绩,而且是为了应付这些潜在的不合理性。我们展示了B-Pref的效用,通过使用它来分析基于算法的设计选择,例如选择信息性查询,以便选择基于优惠的状态启动点/RL的RL值。我们希望B-Preal的源代码能够作为共同的源源码。我们希望的源码。我们可以选择。