Large language models (LLMs) have exploded in popularity in the past few years and have achieved undeniably impressive results on benchmarks as varied as question answering and text summarization. We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result, this time outperforming humans at common sense ethical reasoning (as measured by accuracy on a subset of the ETHICS dataset). Unfortunately, we find that relying on average performance to judge capabilities can be highly misleading. LLM errors differ systematically from human errors in ways that make it easy to craft adversarial examples, or even perturb existing examples to flip the output label. We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions. Our results highlight how human-like performance does not necessarily imply human-like understanding or reasoning.
翻译:大型语言模型(LLMS)在过去几年里受到欢迎,在问题回答和文本总结等各种基准上取得了不可否认的令人印象深刻的成果。我们提供了一个简单的新的促动战略,导致又一个所谓“超人”的结果,这次以常识道德推理(以ETHICS数据集中一组数据的精确度衡量)比人好。 不幸的是,我们发现,依赖平均表现来判断能力可能具有高度误导性。LLMM错误与人为错误有系统性的区别,其方式是便于编篡对抗性实例,甚至破坏现有实例,以翻转产出标签。我们还观察到一些例子出现反向缩缩缩缩缩缩缩缩,以模型大小表示某些例子,并表明促使模型“解释其推理”往往导致不道德行动的令人吃惊的理由。我们的结果突出表明,人性表现并不一定意味着人性理解或推理。