Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.
翻译:预训练语言模型常常会产生与人类偏好不符的输出,例如有害文本或事实不正确的摘要。最近的工作通过学习一种简单的人类反馈形式--模型生成输出的比较--来解决上述问题。然而,比较反馈只传递了有限的关于人类偏好的信息。在本文中,我们介绍了来自语言反馈的模仿学习(ILF),一种利用更多信息的新方法。 ILF 包括三个迭代步骤: 首先,在输入、初始LM输出和反馈的条件下生成细化输出。其次,选择融入最多反馈的细化输出。第三,对细化输出微调语言模型,以最大化在给定输入下选择的细化输出的可能性。我们从理论上证明了 ILF 可以被视作贝叶斯推理,类似于从人类反馈中进行强化学习。我们对一个仔细控制的玩具任务和一个现实的摘要任务评估了 ILF 的有效性。实验表明,大型语言模型可以准确地融入反馈,而使用 ILF 进行微调可以随着数据集大小的增加而扩展,甚至优于对人工摘要的微调。同时使用语言和比较反馈进行学习优于单独使用每种方法,实现了人类级别的摘要性能。