Pre-trained language models (PLMs) improve the model generalization by leveraging massive data as the training corpus in the pre-training phase. However, currently, the out-of-distribution (OOD) generalization becomes a generally ill-posed problem, even for the large-scale PLMs in natural language understanding tasks, which prevents the deployment of NLP methods in the real world. To facilitate the research in this direction, this paper makes the first attempt to establish a unified benchmark named GLUE-X, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. To this end, we collect 13 publicly available datasets as OOD test data, and conduct evaluations on 8 classic NLP tasks over \emph{18} popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.
翻译:培训前语言模型(PLM)通过在培训前阶段的训练内容中利用大量数据作为培训前阶段的训练材料,改进了模式的概括化,但目前,分配外(OOD)的概括化普遍成为一个问题,即使是在自然语言理解任务中的大规模PLM(PLM),这阻碍了在现实世界中部署NLP方法。为了便利这方面的研究,本文件首次试图建立一个名为GLUE-X的统一基准,强调OOD稳健性的重要性,并就如何衡量模型的稳健性以及如何加以改进提供见解。为此,我们收集了13个公开存在的数据集,作为OOD测试数据,并对8个典型NLP任务进行了评价,覆盖了每18}流行使用的模型。我们的调查结果证实,由于在所有环境中都发现了显著的性能衰减与身份准确性能,所以需要更多地注意NLP任务的OD准确性。