Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data. To address this, we propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of multiple controls, which paints a rich picture of the LM capabilities. Our main findings are that: (a) different LMs exhibit qualitatively different reasoning abilities, e.g., RoBERTa succeeds in reasoning tasks where BERT fails completely; (b) LMs do not reason in an abstract manner and are context-dependent, e.g., while RoBERTa can compare ages, it can do so only when the ages are in the typical range of human ages; (c) On half of our reasoning tasks all models fail completely. Our findings and infrastructure can help future work on designing new datasets, models and objective functions for pre-training.
翻译:在这项工作中,我们提出了八项推理任务,从概念上说需要比较、组合和构成等操作。一项基本的挑战在于了解在一项任务上LM的绩效是否应当归因于培训前的表述或任务数据微调过程。为了解决这个问题,我们提议了一项评价协议,其中包括零射评价(不作微调),以及将微调LM的学习曲线与多种控制的学习曲线作比较,后者描绘出丰富的LM能力。我们的主要结论是:(a) 不同的LMS展示了性质不同的推理能力,例如,ROBERTA在推理任务中成功,而BERT完全失败了;(b) LMS没有抽象的理由,而且根据具体情况而定,例如,RoBERTA可以比较年龄,只有在设计了我们未来典型的模型时,它才能这样做。 (b) 关于我们未来的典型数据模型,我们未来的典型模型是失败的。