Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we focus on a systematic study of causal confusion and reward misidentification when learning from preferences. In particular, we perform a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states -- resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification. We also identify a set of methods with which to interpret misidentified learned rewards. In general, we observe that optimizing misidentified rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of preference learning to reward misidentification and causal confusion -- failure to consider even one of many factors can result in unexpected, undesirable behavior.
翻译:基于偏好的奖励学习通过学习策略已成为一种越来越流行的定制代理行为的方法,但是已经被零星地证明容易受到虚假关联和奖励削减行为的影响。虽然之前的很多工作都集中于强化学习和行为克隆中的因果混淆,但是我们专注于对从偏好中学习时的因果混淆和奖励识别错误的系统研究。特别是,我们针对多个基准领域进行了一系列敏感性和消融分析,发现从偏好中学到的奖励能够实现最小测试误差,但在无法处理的状态下无法进行泛化,从而使得在优化时出现了糟糕的策略表现。我们发现,非因果干扰特征、所声述偏好中的噪声以及部分状态的可观察性都可能会使奖励识别错误更加严重。我们还确定了一组方法来解释错误识别的学习奖励。总的来说,我们观察到,优化错误识别的奖励会使策略脱离奖励的训练分布,并导致高预测(学习)奖励但低真实奖励。这些发现揭示了偏好学习易受到奖励识别的影响以及因果混淆的影响——即使考虑许多因素中的一项也会导致意外的、不良的行为。