一名一般语文助理,作为协调实验室 (A General Language Assistant as a Laboratory for Alignment)

Amanda Askell,Yuntao Bai,Anna Chen,Dawn Drain,Deep Ganguli,Tom Henighan,Andy Jones,Nicholas Joseph,Ben Mann,Nova DasSarma,Nelson Elhage,Zac Hatfield-Dodds,Danny Hernandez,Jackson Kernion,Kamal Ndousse,Catherine Olsson,Dario Amodei,Tom Brown,Jack Clark,Sam McCandlish,Chris Olah,Jared Kaplan

from arxiv, 26+19 pages; v2 typos fixed, refs added, figure scale / colors fixed; v3 correct very non-standard TruthfulQA formatting and metric, alignment implications slightly improved

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.

翻译：鉴于大型语言模型的广泛能力,应当有可能努力建立一个与人类价值观相一致的通用的、基于文本的助理,这意味着它有用、诚实和无害。我们首先研究简单的基线技术和评估,例如催化。我们发现,微小干预的好处随着模型规模的扩大而增加,概括为各种调整评价,而且不会损害大型模型的绩效。接下来我们调查与调整、比较模仿学习、二进制歧视和排位优惠模型有关的若干培训目标的扩大趋势。我们发现,排位优惠模型比模仿学习要好得多,而且往往比模型规模大得多。相比之下,二进制歧视通常表现和规模与模仿学习非常相似。我们最后研究“参考模型前培训”培训阶段,目的是在微调人类偏好时提高样本效率。