We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.
翻译:我们提出了一种新的学习模式,使语言模型(LMs)能够与人类价值观重新一致。通过模拟价值不一致和价值一致的文本之间的编辑链,LM微调并通过强化学习进行进一步的完善,第二思想不仅在三个价值调整基准数据集中取得优异的成绩,而且在几眼情景中也显示出强大的人的价值转移学习能力。所产生的编辑步骤也为互动更正错误提供了更好的解释和方便。广泛的人类评价进一步证实了其有效性。