Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https:// github.com/ VRPO/ VRPO.
翻译:人类反馈强化学习(RLHF)已成为将大语言模型(LLM)输出与人类偏好对齐的关键技术。为学习奖励函数,现有大多数RLHF算法采用Bradley-Terry模型,该模型基于对人类偏好的假设,但可能无法反映现实世界判断的复杂性和多变性。本文提出一种鲁棒算法,以增强现有方法在奖励模型设定错误情况下的性能。理论上,该算法降低了奖励与策略估计量的方差,从而改进了遗憾界。在LLM基准数据集上的实证评估表明,所提算法持续优于现有方法,在Anthropic Helpful and Harmless数据集上,77-81%的响应优于基线。代码发布于https://github.com/VRPO/VRPO。