In the age of large transformer language models, linguistic evaluation play an important role in diagnosing models' abilities and limitations on natural language understanding. However, current evaluation methods show some significant shortcomings. In particular, they do not provide insight into how well a language model captures distinct linguistic skills essential for language understanding and reasoning. Thus they fail to effectively map out the aspects of language understanding that remain challenging to existing models, which makes it hard to discover potential limitations in models and datasets. In this paper, we introduce Curriculum as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena. Curriculum contains a collection of datasets that covers 36 types of major linguistic phenomena and an evaluation procedure for diagnosing how well a language model captures reasoning skills for distinct types of linguistic phenomena. We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality. In addition, our experiments provide insight into the limitation of existing benchmark datasets and state-of-the-art models that may encourage future research on re-designing datasets, model architectures, and learning objectives.
翻译:在大型变压器语言模型的时代,语言评价在诊断模型的能力和自然语言理解的局限性方面发挥着重要作用。但是,目前的评价方法显示出一些重大缺陷。特别是,它们没有深入了解语言模型如何很好地捕捉了语言理解和推理所必不可少的不同语言技能。因此,它们未能有效地绘制出对于现有模型仍然具有挑战性的语言理解的各个方面,这使得难以发现模型和数据集中的潜在局限性。在本文件中,我们引入了课程,作为国家语言研究所评估广泛覆盖语言现象基准的新格式。课程包括一套涵盖36种主要语言现象的数据集,以及一种评估程序,用于评估语言模型如何很好地捕捉不同类型语言现象的推理技能。我们表明,这种由语言同蛋白质驱动的基准可以作为诊断模型行为和核实模型学习质量的有效工具。此外,我们的实验可以深入了解现有基准数据集和状态模型的局限性,从而鼓励今后对重新设计数据集、模型结构、学习目标进行研究。