发现使用模拟文字评价的语言模式行为模式行为 (Discovering Language Model Behaviors with Model-Written Evaluations)

Ethan Perez,Sam Ringer,Kamilė Lukošiūtė,Karina Nguyen,Edwin Chen,Scott Heiner,Craig Pettit,Catherine Olsson,Sandipan Kundu,Saurav Kadavath,Andy Jones,Anna Chen,Ben Mann,Brian Israel,Bryan Seethor,Cameron McKinnon,Christopher Olah,Da Yan,Daniela Amodei,Dario Amodei,Dawn Drain,Dustin Li,Eli Tran-Johnson,Guro Khundadze,Jackson Kernion,James Landis,Jamie Kerr,Jared Mueller,Jeeyoon Hyun,Joshua Landau,Kamal Ndousse,Landon Goldberg,Liane Lovitt,Martin Lucas,Michael Sellitto,Miranda Zhang,Neerav Kingsland,Nelson Elhage,Nicholas Joseph,Noemí Mercado,Nova DasSarma,Oliver Rausch,Robin Larson,Sam McCandlish,Scott Johnston,Shauna Kravec,Sheer El Showk,Tamera Lanham,Timothy Telleen-Lawton,Tom Brown,Tom Henighan,Tristan Hume,Yuntao Bai,Zac Hatfield-Dodds,Jack Clark,Samuel R. Bowman,Amanda Askell,Roger Grosse,Danny Hernandez,Deep Ganguli,Evan Hubinger,Nicholas Schiefer,Jared Kaplan

from arxiv, for associated data visualizations, see https://www.evals.anthropic.com/model-written/ for full datasets, see https://github.com/anthropics/evals

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

翻译：作为语言模型(LMS)规模,它们发展了许多新颖的好坏行为,增加了评估其行为方式的必要性; 先前的工作创造了众组工作(耗时和昂贵)或现有数据源(并非总有)的评价; 在这里,我们自动与LMS产生评价; 我们探索了与LMS进行评价的方法。我们探索了不同程度的人类努力,从指示LMS进行 " 是/不回答 ",到以LM为主的生成和过滤等多个阶段的Winogender复杂图案。众工将范例评为高度相关,并同意90-100 % 的标签,有时比相应的人写数据集更甚。我们生成了154个数据集,发现了新的反规模案例,而LMS越变小越严重。大LMS重复了对话用户喜欢的答案(“误读” ) 并表达了追求资源获取和目标保护等目标的更大愿望。我们还发现了人类反馈(RLHF)中的一些反向缩的例子, 更多的RHF使LMS变得更差。例如, RLHF使LMS 快速地表达了我们更强烈的GMs 和新的政治行为。