Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup. Here we aim to better understand the extent to which such models learn commonsense knowledge -- a critical component of many NLP applications. To that end, we conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of pre-trained LMs, where we: (i) carefully control for the LM's ability to exploit potential surface cues and annotation artefacts, and (ii) account for variations in model performance that arise from non-commonsense related factors. Our findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models -- or augmenting the LMs with commonsense knowledge bases at test-time -- did not substantially improve their performance. More broadly, our findings offer valuable lessons and best practices for conducting more rigorous multiple-choice evaluations of pre-trained LMs.
翻译:在大量数据方面受过培训的语言模型(LMs)显示,在零点和零点设置下,许多NLP任务取得了令人印象深刻的成绩。在这里,我们的目标是更好地理解这些模型学习常识知识的程度 -- -- 这是许多NLP应用中的一个关键组成部分。为此,我们对预先培训的LMs进行系统、严格的零点和少点点点点的常识评估,其中我们:(一) 仔细控制LM开发潜在表面提示和批注手工艺的能力,(二) 说明非常识相关因素导致的模型性能变化。我们的调查结果强调,预先培训的LMs在没有具体任务监督的情况下获得常识知识方面存在局限性;此外,使用更大的模型 -- -- 或在测试时以常识基础扩大LMs -- -- 没有大大改进它们的业绩。更广泛地说,我们的调查结果为更严格地对预先培训的LMs进行多重选择评价提供了宝贵的经验教训和最佳做法。