We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
翻译:我们测试了这样一种假设,即通过从人类反馈中强化学习(RLHF)而受过培训的语言模式有能力“自我纠正” -- -- 以避免产生有害产出 -- -- 如果得到指示的话。我们通过三个不同的实验找到有力的证据来支持这一假设,每个实验都揭示了道德自我纠正的不同方面。我们发现道德自我纠正的能力出现在22B模型参数上,并且通常随着模型规模的扩大和RLHF培训的不断提高而得到改善。我们认为,在这一规模上,语言模式获得两种能力,可用于道德自我纠正:(1)它们可以遵循指令,(2)它们可以学习关于伤害的复杂规范概念,例如陈规定型观念、偏见和歧视。因此,它们可以遵循避免某些道德上有害的产出的指示。我们认为,我们的结果使人们对培训语言模型以遵守道德原则的能力持谨慎的乐观态度。