The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.
翻译:设计奖励功能的复杂性是广泛应用深层强化学习(RL)技术的一大障碍。即使对专家来说,描述代理人所希望的行为和特性也可能是困难的。一个新的范例称为从人类偏好(或基于优惠的RL)中强化学习,这是一个很有希望的解决办法,其中从行为轨迹中的人类偏好标签中学习奖励功能。然而,基于优惠的RL的现有方法由于需要准确的或触角的优惠标签而受到限制。本文件通过开发一种众包优惠标签和学习不同人类偏好的方法来解决这一局限性。关键思想是通过在潜在空间的正规化和校正来稳定奖励学习。为了确保时间的一致性,对将潜在空间逼近先前分布的奖励模式施加了严格的限制。此外,基于信任的奖励模式组合方法旨在产生更稳定和可靠的预测。在管理部和梅塔-世界的各种任务中测试了拟议的方法,并在从不同反馈中学习现有基于优惠的RL算法,为实际应用RL的方法铺平了道路时,显示出与现有基于优惠的RL算法的一致和显著的改进。