Large language models have shown impressive performance on many natural language processing (NLP) tasks in a zero-shot setting. We ask whether these models exhibit commonsense understanding -- a critical component of NLP applications -- by evaluating models against four commonsense benchmarks. We find that the impressive zero-shot performance of large language models is mostly due to existence of dataset bias in our benchmarks. We also show that the zero-shot performance is sensitive to the choice of hyper-parameters and similarity of the benchmark to the pre-training datasets. Moreover, we did not observe substantial improvements when evaluating models in a few-shot setting. Finally, in contrast to previous work, we find that leveraging explicit commonsense knowledge does not yield substantial improvement.
翻译:大型语言模型在零点环境下在许多自然语言处理(NLP)任务上表现出了令人印象深刻的业绩。我们询问这些模型是否根据四个常点基准对模型进行评估,从而表现出共同的认识 -- -- 这是NLP应用的关键组成部分。我们发现,大型语言模型令人印象深刻的零点表现主要是由于我们的基准中存在数据集偏差。我们还表明,零点表现对选择超参数和将基准与培训前数据集相类似十分敏感。此外,我们没有看到在片点评估模型时出现重大改进。最后,与以往的工作不同,我们发现,利用明确的常点知识并没有产生实质性改进。