Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.
翻译:语言模型(LMS)在模仿互联网文本方面有先入之见,包括如果由LM产生就会违反人类偏好的内容:错误、冒犯性评论、个人识别信息、低质量或错误代码等等。在这里,我们探索了培训LMS的替代目标,以同样的方式指导他们产生符合人类喜好的案文。我们用人类反馈为预培训设定了五个目标,分三项任务,并研究它们如何影响经过预先培训的LM的调整和能力之间的取舍。我们发现,在我们所探讨的人群中,一种最优和简单的方法是:有条件的培训,或学习比以奖励模式给人的偏好分为条件的标志的分布。有条件培训将不受欢迎的内容降低到一个数量级,在没有及时生成和有对抗性倾向的文本时,既能指导他们产生与人类喜好一致的文本。此外,有条件的培训保持标准的LM培训前期的下游工作业绩,无论是在具体任务调整之前还是之后。我们发现,与人类反馈前期培训相比,其满意程度要好得多,然后对反馈进行微调,即学习,然后从学习和不学习不可取的人类的学习。我们建议,在学习模式中开始学习后开始学习。