Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.
翻译:培训语言模型的现有技术可能与事实不相符:如果我们用模拟学习来培训模型,它们可能会复制人类的错误;如果我们用模拟学习来培训模型来生成人类高比率的文本,它们可能会产生人类无法检测的错误。我们建议通过在语言模型的内部启动过程中以完全不受监督的方式直接找到潜在知识来回避这一问题。具体地说,我们引入了一种方法来准确回答只给未标记的模式启动的“不回答”问题。它通过找到一个方向来激活符合逻辑一致性特性的空间,例如声明及其否定具有相反的真理价值。我们表明,尽管没有使用监督,也没有模型产出,我们的方法可以恢复在大型语言模型中体现的多种知识:跨越6个模型,10个问答数据集,平均比4个零点的精确度高。我们还发现,它会将提示性灵敏度降低一半,即使在模型产生错误的答案时,也会继续保持高准确性。我们的结果提供了在发现语言模型知道什么方面迈出的第一步,与它们所说的不同,即使我们没有进入明确的地面标签。