Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability of these open-ended generation systems, we aim to identify qualitative categories of erroneous behavior, beyond identifying individual errors. To hypothesize and test for such qualitative errors, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases as motivation to (i) generate hypotheses for problems that models may have, and (ii) develop experiments that elicit these problems. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to elicit high-impact errors such as incorrectly deleting files. Our results indicate that experimental methodology from cognitive science can help characterize how machine learning systems behave.
翻译:大型语言模型产生复杂、开放的产出: 与其输出一个分类标签, 编写摘要, 生成对话, 或生成工作代码。 为了评估这些不限名额的一代系统的可靠性, 我们的目标是确定错误行为的质量类别, 而不是识别个人错误。 为了虚度和测试这类质量错误, 我们从人类认知偏差中汲取灵感 -- -- 系统偏离理性判断的系统模式。 具体地说, 我们使用认知偏差作为动机, (一) 生成模型可能存在的问题的假设, 以及 (二) 开发引起这些问题的实验。 我们用代码生成作为案例研究, 我们发现 OpenAI 的代码错误可以预测地基于输入提示的设置, 调整输出到锁定点, 偏向于模拟常见培训实例的产出。 我们然后使用我们的框架来产生高影响错误, 比如错误删除文件。 我们的结果表明, 认知科学实验方法可以帮助描述机器学习系统的行为 。