How should we compare the capabilities of language models and humans? Here, I consider a case study: processing of recursively nested grammatical structures. Prior work has suggested that language models cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training before being evaluated, while the language models were evaluated zero-shot. I therefore attempt to more closely match the evaluation paradigms by providing language models with few-shot prompts. A simple prompt, which contains substantially less content than the human training, allows large language models to consistently outperform the human results. The same prompt even allows extrapolation to more-deeply-nested conditions than have been tested in humans. Further, a reanalysis of the prior human experiments suggests that the humans may not perform above chance at the difficult structures initially. These results suggest that large language models can in fact process recursively nested grammatical structures comparably to humans. This case study highlights how discrepancies in the quantity of experiment-specific context can confound comparisons of language models and humans. I use this case study to reflect on the broader challenge of comparing human and model capabilities, and to suggest that there is an important difference between evaluating cognitive models of a specific phenomenon and evaluating broadly-trained models.
翻译:我们应如何比较语言模型和人的能力?在这里,我考虑一个案例研究:处理循环嵌套的语法结构。先前的工作表明,语言模型无法象人类那样可靠地处理这些结构。然而,在评估之前,向人类提供了指导和培训,而语言模型则被评估为零。因此,我试图通过提供几发的语法模型来更密切地匹配评价模式。一个简单的即时方法,其内容远远少于人类培训,它使得大型语言模型能够持续地超过人类结果。同样迅速的方法甚至允许外推到比人类所测试的更深得多的条件。此外,对以前人类实验的重新分析表明,人类在最初的艰难结构中可能不会超过机会。这些结果表明,大型语言模型在事实上可以与人类相对可比较的随机嵌套的语法结构。本案例研究突出表明,在实验特定背景中,大语言模型和人类模型数量的差异如何能够调和人类结果的对比。我利用这一案例研究来思考在比较人类具体模型和能力方面的较广泛挑战,而重要的模型和对具体模型的模型和能力的评估是一个重要的模型的比较。