Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and these models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). PPO, however, is sensitive to hyperparameters and requires a minimum of four models in its standard implementation, which makes it hard to train. In contrast, we propose a novel learning paradigm called RRHF, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. RRHF can efficiently align language model output probabilities with human preferences as robust as fine-tuning and it only needs 1 to 2 models during tuning. In addition, RRHF can be considered an extension of SFT and reward models while being simpler than PPO in terms of coding, model counts, and hyperparameters. The entire alignment process can be accomplished within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca on Helpful and Harmless data, demonstrating performance comparable to PPO.
翻译:无泪地将等级响应与人类反馈对齐的RRHF
翻译后的摘要
强化学习从人类反馈中学习(RLHF)有助于将大型语言模型与人类偏好对齐,显著提升人类与这些模型之间的交互质量。InstructGPT通过多个阶段实现RLHF,包括监督微调(SFT)、奖励模型训练和近端策略优化(PPO)。然而,PPO对超参数敏感,在标准实现中需要至少四个模型,这使训练难度较大。相比之下,我们提出了一种新的学习范式,称为RRHF,它评分不同采样策略产生的响应,并通过排名损失学习将它们与人类偏好对齐。RRHF可以有效地将语言模型的输出概率与人类偏好对齐,与微调一样稳健,并且在调优过程中只需要1到2个模型。此外,RRHF可以被视为SFT和奖励模型的扩展,同时在编码、模型数量和超参数方面比PPO更简单。整个对齐过程可以在单个RRHF训练会话中完成。我们使用LLaMA和Alpaca对Helpful和Harmless数据进行评估,证明了与PPO相当的性能。