Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently \emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.
翻译:目前的语言模型被认为在诸如问答或写法代码等自然语言任务中具有非人的能力。 但是,语言模型没有受过训练,无法很好地完成这些任务, 语言模型也受过训练, 能够准确预测在象征性文本中先前的反面给出的下一个标记。 在下一个象征性预测中, 语言模型是否比人类好或坏还不清楚。 为了回答这个问题, 我们进行了两个不同的实验, 直接比较人类和这方面的语言模型: 一个测量最高至一级精确度, 另一个测量不易。 在这两个实验中, 我们发现人类始终是 \ emph{worse}, 甚至比GPT3-Ada 等相对较小的语言模型在下方预测中也一致。