Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we focus on a systematic study of causal confusion and reward misidentification when learning from preferences. In particular, we perform a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states -- resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification. We also identify a set of methods with which to interpret misidentified learned rewards. In general, we observe that optimizing misidentified rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of preference learning to reward misidentification and causal confusion -- failure to consider even one of many factors can result in unexpected, undesirable behavior.
翻译:通过以优惠为基础的奖赏学习政策是定制代理人行为的一种日益流行的方法,但传闻显示,这种政策容易产生虚假的关联和奖励黑客行为。虽然许多先前的工作侧重于强化学习和行为克隆方面的因果关系,但我们侧重于系统地研究因果混淆,并在从偏好中学习时奖励错误认同。特别是,我们对从优惠中获得的奖赏达到最低测试错误但未能向分配外国家推广的一系列基准领域进行一系列敏感性和通货膨胀分析,结果在优化时导致政策业绩不佳。我们发现,非因果分散特征、既定偏好中的噪音和部分国家服从性都可能加剧奖赏错误认同。我们还确定了一套方法,用以解释错误的所得奖赏。一般而言,我们发现,优化错误识别的奖赏政策会影响奖赏分配,导致高预测的(得)奖赏,但真实的奖赏很低。这些结论说明了为奖励错误认同和因果混淆而接受的优待感。我们甚至没有考虑到许多因素中的意外、不良行为。</s>