语言模式(多数)知道他们知道什么 (Language Models (Mostly) Know What They Know)

Saurav Kadavath,Tom Conerly,Amanda Askell,Tom Henighan,Dawn Drain,Ethan Perez,Nicholas Schiefer,Zac Hatfield Dodds,Nova DasSarma,Eli Tran-Johnson,Scott Johnston,Sheer El-Showk,Andy Jones,Nelson Elhage,Tristan Hume,Anna Chen,Yuntao Bai,Sam Bowman,Stanislav Fort,Deep Ganguli,Danny Hernandez,Josh Jacobson,Jackson Kernion,Shauna Kravec,Liane Lovitt,Kamal Ndousse,Catherine Olsson,Sam Ringer,Dario Amodei,Tom Brown,Jack Clark,Nicholas Joseph,Ben Mann,Sam McCandlish,Chris Olah,Jared Kaplan

from arxiv, 23+17 pages; refs added, typos fixed

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

翻译：我们研究语言模型是否能够评估其自身主张的有效性,并预测他们能够正确回答哪些问题。我们首先显示,较大的模型在以正确格式提供时,在多种选择和真实/假问题的不同选择和真实/假问题上都得到了很好的校准。这样,我们就可以在开放式抽样任务上进行自我评价,先请模型提出答案,然后评估“P(True)”的概率,即其答案是正确的。我们发现P(True)在一系列不同任务上的表现、校准和缩放令人鼓舞。自我评价的绩效进一步提高,当我们允许模型在预测某一具体可能性的有效性之前考虑自己的许多样本时。接下来,我们调查模型是否可以通过培训来预测“P(IK)”,即“我知道”对一个问题的答案,而没有参考任何具体的拟议答案。模型在预测P(IK)方面表现良好,并且部分地概括了各项任务,尽管它们与P(IK)的校准有关新任务。预测的P(IK)的精确性将进一步提高。预测P(IK)的准确性也提高模型的准确性,从而发现具有更真实性,在数学基础的答案中,我们如何理解地看到,在一般的模型中如何理解。