As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
翻译:随着AI系统变得更加有能力,我们想争取它们的帮助来监督其他AI系统。 我们实验通过自我改进来培训无害AI助理的方法,没有人类标签来识别有害产出。 唯一的人类监督是通过一套规则或原则来提供的, 所以我们将这种方法称为“ 宪法AI ” 。 这一过程既包括监督学习,也包括强化学习阶段。 在监督阶段, 我们从初始模型中抽样, 然后产生自我批评和修改, 然后对修改后的反应原始模型进行微调。 在RL阶段, 我们从微调模型中取样, 使用模型来评估两种样本中的哪一个更好, 然后用这种AI偏好数据集来培训一个偏好模式。 我们然后用RL作为“ 宪法AI ” 奖赏信号, 也就是说, 我们使用“ AI 反馈RL ” (RLAIF) 。 结果是, 我们训练了一名无害但非规避性的AI 助理, 通过解释其反对意见来参与有害查询。 SRL 和RL 方法都可以利用一个模型来评估两种样本中的哪个样本, 然后用这套模型来训练一个优惠模式的偏好模式, 。 我们用REL 来用这个优惠模式来训练它去更低的AI 的操作方式来改进人类的标签。