Pretrained language models often do not perform tasks in ways that are in line with our preferences, e.g., generating offensive text or factually incorrect summaries. Recent work approaches the above issue by learning from a simple form of human evaluation: comparisons between pairs of model-generated task outputs. Comparison feedback conveys limited information about human preferences per human evaluation. Here, we propose to learn from natural language feedback, which conveys more information per human evaluation. We learn from language feedback on model outputs using a three-step learning algorithm. First, we condition the language model on the initial output and feedback to generate many refinements. Second, we choose the refinement with the highest similarity to the feedback. Third, we finetune a language model to maximize the likelihood of the chosen refinement given the input. In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements, finding that only large language models (175B parameters) do so. Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
翻译:受过训练的语言模式往往不以符合我们偏好的方式执行任务,例如,产生冒犯性文本或事实不正确的摘要。最近的工作通过学习简单的人类评价形式来处理上述问题:对模型产生的任务产出进行对比;比较反馈传达的关于人类对人的偏好的信息有限;在这里,我们建议学习自然语言反馈,这种反馈能传达更多的人类评价信息;我们用三步学习算法从语言对模型产出的反馈中学习。首先,我们将语言模式以初始输出和反馈作为条件,以产生许多改进。第二,我们选择与反馈最相似的精细。第三,我们微调一种语言模式,以尽量扩大所选择的精细化可能性,因为投入。在合成实验中,我们首先评估语言模式是否准确地纳入反馈以产生改进,发现只有大型语言模型(175B参数)这样做。我们学习算法仅使用100个人类写反馈样本,将GPT-3模型精细化为人类层面的总结。