Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we aim to study it in the context of reward learning. To study causal confusion, we perform a series of sensitivity and ablation analyses on three benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states -- resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, partial state observability, and larger model capacity can all exacerbate causal confusion. We also identify a set of methods with which to interpret causally confused learned rewards: we observe that optimizing causally confused rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of reward learning to causal confusion, especially in high-dimensional environments -- failure to consider even one of many factors (data coverage, state definition, etc.) can quickly result in unexpected, undesirable behavior.
翻译:通过基于优惠的奖赏学习政策是定制代理人行为的一种日益流行的方法,但传闻表明,这种行为容易产生虚假的相关性和奖励黑客行为。虽然许多先前的工作侧重于强化学习和行为克隆方面的因果关系,但我们的目标是在奖励学习的背景下研究这一问题。为了研究因果关系,我们对三个基准领域进行了一系列的敏感性和消化分析,从优惠中获得的奖赏达到最低的测试错误,但未能推广到分布国 -- -- 在优化时导致政策表现不佳。我们发现,非因果分散特征的存在、公开偏好中的噪音、部分国家可耐性和更大的模型能力都可能加剧因果关系。我们还确定了一套解释因果混淆的奖赏的方法:我们发现,优化因果混杂的奖赏促使政策脱离了奖赏的培训分配,导致高预测(得益)但真正的奖赏低。这些发现,奖励学习容易导致因果关系混乱,特别是在高层次环境中。我们发现,即使是许多因素之一(数据覆盖范围、状态定义等),也无法很快在意想不到的行为中考虑。