We introduce a new type of test, called a Turing Experiment (TE), for evaluating how well a language model, such as GPT-3, can simulate different aspects of human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We give TEs that attempt to replicate well-established findings in prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models.
翻译:我们引入了一种新型测试,称为图灵实验(Turing 实验),用于评估诸如GPT-3(GPT-3)等语言模型能够模拟人类行为的各个方面有多好。与图灵试验(涉及模拟单一的任意性个人)不同,TE要求模拟具有代表性的人类主题研究参与者样本。我们给TE(TE)提供一种尝试在先前的研究中复制既定发现的方法。我们设计了一种模拟TE的方法,并用它来比较不同语言模型能够复制经典经济、精神语言和社会心理学实验(Ultimtum Game、花园路径判决、Milgram震荡实验和人群智慧实验)有多好。在前三个TE中,现有研究结果被使用最近的模型复制,而最后一个TE(TE)则揭示了某些语言模型中存在的“精度扭曲”现象。