宪法大赦国际:与大赦国际的反馈不相容 (Constitutional AI: Harmlessness from AI Feedback)

Yuntao Bai,Saurav Kadavath,Sandipan Kundu,Amanda Askell,Jackson Kernion,Andy Jones,Anna Chen,Anna Goldie,Azalia Mirhoseini,Cameron McKinnon,Carol Chen,Catherine Olsson,Christopher Olah,Danny Hernandez,Dawn Drain,Deep Ganguli,Dustin Li,Eli Tran-Johnson,Ethan Perez,Jamie Kerr,Jared Mueller,Jeffrey Ladish,Joshua Landau,Kamal Ndousse,Kamile Lukosuite,Liane Lovitt,Michael Sellitto,Nelson Elhage,Nicholas Schiefer,Noemi Mercado,Nova DasSarma,Robert Lasenby,Robin Larson,Sam Ringer,Scott Johnston,Shauna Kravec,Sheer El Showk,Stanislav Fort,Tamera Lanham,Timothy Telleen-Lawton,Tom Conerly,Tom Henighan,Tristan Hume,Samuel R. Bowman,Zac Hatfield-Dodds,Ben Mann,Dario Amodei,Nicholas Joseph,Sam McCandlish,Tom Brown,Jared Kaplan

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

翻译：随着AI系统变得更加有能力,我们想争取它们的帮助来监督其他AI系统。我们实验通过自我改进来培训无害AI助理的方法,没有人类标签来识别有害产出。唯一的人类监督是通过一套规则或原则来提供的, 所以我们将这种方法称为“ 宪法AI ” 。这一过程既包括监督学习,也包括强化学习阶段。在监督阶段, 我们从初始模型中抽样, 然后产生自我批评和修改, 然后对修改后的反应原始模型进行微调。在RL阶段, 我们从微调模型中取样, 使用模型来评估两种样本中的哪一个更好, 然后用这种AI偏好数据集来培训一个偏好模式。我们然后用RL作为“ 宪法AI ” 奖赏信号, 也就是说, 我们使用“ AI 反馈RL ” (RLAIF) 。结果是, 我们训练了一名无害但非规避性的AI 助理, 通过解释其反对意见来参与有害查询。 SRL 和RL 方法都可以利用一个模型来评估两种样本中的哪个样本, 然后用这套模型来训练一个优惠模式的偏好模式, 。我们用REL 来用这个优惠模式来训练它去更低的AI 的操作方式来改进人类的标签。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/