Nowadays, pretrained language models (PLMs) have dominated the majority of NLP tasks. While, little research has been conducted on systematically evaluating the language abilities of PLMs. In this paper, we present a large-scale empirical study on general language ability evaluation of PLMs (ElitePLM). In our study, we design four evaluation dimensions, i.e. memory, comprehension, reasoning, and composition, to measure ten widely-used PLMs within five categories. Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; (3) PLMs have excellent transferability between similar tasks. Moreover, the prediction results of PLMs in our experiments are released as an open resource for more deep and detailed analysis on the language abilities of PLMs. This paper can guide the future work to select, apply, and design PLMs for specific tasks. We have made all the details of experiments publicly available at https://github.com/RUCAIBox/ElitePLM.
翻译:目前,经过培训的语言模式(PLM)占了全国语言平台的大部分任务。虽然对系统评估PLM语言能力的研究很少。我们在本文件中介绍了对PLM语言能力的一般评估(ELIPLM)的大规模经验性研究。我们在研究中设计了四个评估层面,即记忆、理解、推理和构成,以测量五类中广泛使用的10个PLM。我们的经验结果表明:(1)培训目标和战略各不相同的PLM在不同的能力测试中是好的;(2)下游任务中微调PLM通常对数据大小和分布很敏感;(3)PLMs在类似任务之间具有极好的可转移性。此外,我们实验中PLMS的预测结果作为更深入和详细分析PLMs语言能力的公开资源发布。这份文件可以指导今后为具体任务选择、应用和设计PLMS的工作。我们已经在https://github.com/RUCABox/ELitePLM中公布了所有实验细节。我们已在https://github.com/RCABox/ELIPM中公布了。