We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. We study scaling properties of critiquing with both topic-based summarization and synthetic tasks. Larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. Larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. Finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. Our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. These results are a proof of concept for using AI-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. We release our training datasets, as well as samples from our critique assistance experiments.
翻译:我们用行为性克隆来微调使用自然语言批评(自然语言批评评论)的大型语言模型; 在基于主题的总结任务中,我们模型的批评有助于人类发现他们本会错失的摘要中的缺陷; 我们的模型有助于发现模型和人文书面摘要中自然出现的缺陷,以及人文摘要中的故意缺陷,蓄意误导; 我们用基于主题的总结和合成任务来研究滑动的特性; 大模型写得更有帮助的批评和大多数任务,在自我消化方面做得更好,尽管其产出更难批评。 大模型还可以将自己的自我批评作为反馈,将其自己摘要改进为更好的摘要。 最后,我们鼓励和引入一个框架,比较创造和歧视能力的关键能力。 我们的测量结果表明,即使大型模型可能仍然具有相关的知识,它们不能或不会作为批评性说明。 这些结果证明了使用人工辅助的人类反馈的概念,以扩大对机器学习系统的监督,作为反馈,作为反馈的反馈,作为反馈,作为人类直接评估的样本,我们难以评估的任务。