减少伤害的红色团队语言模式:方法、增强行为和经验教训 (Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned)

Deep Ganguli,Liane Lovitt,Jackson Kernion,Amanda Askell,Yuntao Bai,Saurav Kadavath,Ben Mann,Ethan Perez,Nicholas Schiefer,Kamal Ndousse,Andy Jones,Sam Bowman,Anna Chen,Tom Conerly,Nova DasSarma,Dawn Drain,Nelson Elhage,Sheer El-Showk,Stanislav Fort,Zac Hatfield-Dodds,Tom Henighan,Danny Hernandez,Tristan Hume,Josh Jacobson,Scott Johnston,Shauna Kravec,Catherine Olsson,Sam Ringer,Eli Tran-Johnson,Dario Amodei,Tom Brown,Nicholas Joseph,Sam McCandlish,Chris Olah,Jared Kaplan,Jack Clark

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.

翻译：我们讲述了我们早期在红队语言模型方面所做的努力,以便同时发现、测量和试图减少其潜在有害产出。我们做出了三大贡献。首先,我们调查了红队在3个模型规模(2.7B、13B和52B参数)和4个模型类型(普通语言模型)和4个模型类型(普通语言模型)中的规模行为:一个简单的语言模型(LM);一个LM被激发为有用、诚实和无害;一个拒绝抽样的LM;以及一个经过培训的模型,利用人类反馈强化学习(RLHF)来帮助和无害。我们发现,RLHF模型在扩大规模时越来越难以红队,我们发现其他模型类型的规模趋势是平的。第二,我们发布了38,961个红色团队袭击数据集,供他人进行分析和学习。我们自己对数据进行分析,并找到各种有害的产出,从攻击性语言到更隐蔽的非暴力的不道德产出。第三,我们详尽地描述了我们的指示、程序、统计方法以及红色团队的不确定性。我们希望这一透明度如何加快我们作为一个社区共同工作的能力,以便制定红语言模型。