Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup. Here we aim to better understand the extent to which such models learn commonsense knowledge -- a critical component of many NLP applications. We conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of large pre-trained LMs, where we: (i) carefully control for the LMs' ability to exploit potential surface cues and annotation artefacts, and (ii) account for variations in performance that arise from factors that are not related to commonsense knowledge. Our findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models or few-shot evaluation are insufficient to achieve human-level commonsense performance.
翻译:在大量数据方面受过培训的语言模型(LMs)显示,在零点和零点设置下,许多NLP任务取得了令人印象深刻的业绩。在这里,我们的目标是更好地了解这些模型学习常识知识的程度 -- -- 这是许多NLP应用中的一个关键组成部分。我们对大型预先培训的LMs进行系统、严格的零点和零点点常识评价,在那里,我们:(一) 仔细控制LMs利用潜在表面提示和批注手工艺的能力,(二) 说明与常识无关的因素导致的性能变化。我们的调查结果强调了预先培训的LMs在没有具体任务监督的情况下获得常识知识方面的局限性;此外,使用较大的模型或少点点点的常识评价不足以实现人类的常识表现。