As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.
翻译:随着语言模型变得更加强大,培训和评价日益受到用于特定任务的数据和衡量标准的限制。例如,总化模型往往经过培训,以预测人类参考摘要,并使用ROUGE进行评估,但这两种指标都是我们真正关心的 -- -- 概要质量 -- -- 的粗略替代物。在这项工作中,我们表明,通过培训一个模型,优化人类偏好,可以大幅提高概要质量。我们收集大量高质量的人类摘要比较数据,培训一个模型,以预测人类首选摘要,并利用该模型作为奖励功能,用强化学习来微调概括政策。我们运用了我们的方法,对TL的版本进行了精美化;Reddit 站的数据集,发现我们的模型大大超出我们真正关心的人类参考摘要和大得多的模型,仅靠监督学习来改进。我们的模型还转至CNN/DM新闻文章,制作摘要几乎和人类参考模型一样好,不需要任何针对具体新闻的微调。我们进行广泛的分析,以了解人类反馈数据集成和微调模型,以便用强化模型来微调政策。我们用我们的方法来对TLLD数据集进行精细的模型进行微的学习。我们要根据更精确的模型来激励,我们更精确的模型,我们更精确地将改进了我们的模型,让我们的学习新的数据库到更精确地评估。