RAFT: 奖励排名微调用于生成式基础模型对齐 (RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment)

Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially significant repercussions. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) as a means of addressing this problem, wherein generative models are fine-tuned using RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment of generative models, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models more effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently assembles a streaming dataset. This dataset serves as the basis for aligning the generative model and can be employed under both offline and online settings. Notably, the sample generation process within RAFT is gradient-free, rendering it compatible with black-box generators. Through extensive experiments, we demonstrate that our proposed algorithm exhibits strong performance in the context of both large language models and diffusion models.

翻译：生成式基础模型容易出现隐性偏见，这往往是由于大量无监督训练数据引起的。这些偏见可能会产生次优样本、扭曲的结果和不公平性，这可能会引起重大影响。因此，将这些模型与人类伦理和偏好对齐是确保其在实际应用中负责任和有效部署的关键步骤。先前的研究主要采用人类反馈引导的强化学习(Reinforcement Learning from Human Feedback, RLHF)来解决这个问题，其中生成模型使用RL算法进行微调，以人类反馈为指导的奖励模型。然而，RL算法的低效性和不稳定性经常给生成模型的成功对齐提出巨大障碍，需要开发更强大和流畅的方法。为此，我们引入了一种新的框架，奖励排名微调(Reward rAnked FineTuning, RAFT)，旨在更有效地对齐生成模型。利用奖励模型和足够数量的样本，我们的方法选择高质量的样本，丢弃那些展现出不良行为的样本，并随后构建一个流数据集。该数据集成为生成模型对齐的基础，并可在离线和在线环境下使用。值得注意的是，在RAFT中，样本生成过程是无梯度的，这使其与黑盒生成器兼容。通过广泛的实验证明，我们提出的算法在大型语言模型和扩散模型的情况下展现出了很强的性能。