The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments. These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling preferences instead as arising from a different statistic: each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences. We also prove that the previous partial return model lacks this identifiability property without preference noise that reveals rewards' relative proportions, and we empirically show that our proposed regret preference model outperforms it with finite training data in otherwise the same setting. Additionally, our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research.
翻译:强化学习的效用因奖励功能与人类利益攸关方的利益相匹配而受到限制。 一种有希望的调整方法就是从人创造的偏好中学习奖励功能。 这些人类偏好通常假定仅靠部分回报,即每一部分的奖赏总额来了解。 我们发现这一假设有缺陷,并提议模型偏好,而取自不同的统计数字:每个部分的遗憾,这是对某一部分偏离最佳决策的一种衡量标准。考虑到因遗憾而产生的无限多的偏好,我们证明我们可以找到相当于产生这些偏好的奖赏功能的奖赏功能。 我们还证明,以前的部分回归模式缺乏这种可识别性属性,而没有显示奖赏相对比例的偏好噪音,我们从经验上表明,我们提议的遗憾偏好模式在其他方面比有限的培训数据优于它。 此外,我们提议的遗憾偏好模型更好地预测了真正的人类偏好,并从这些偏好中学习奖励功能,从而导致更符合人性的政策。 总体而言,这项工作证明,选择优惠模式是具有影响力的,我们提议的遗憾偏好模式在近期的核心假设基础上提供了改进。