Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 19 popularly used PLMs. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
翻译:众所周知,培训前语言模型(PLM)通过在培训前阶段利用大量数据提高自然语言理解模型的通用性能,但是,许多非分配(OOD)一般化问题仍然是许多非分配(OOD)任务的挑战,限制了这些方法的实际部署,本文件首次试图建立一个统一的基准,名为GLUE-X,用于评价非分配(NLP)模型中的OOD稳健性,强调OOD稳健性的重要性,并就如何衡量一个模型的稳健性和如何改进这一模型提供见解。该基准包括13个可供公开使用的OOD测试数据集,并对超过19个普遍使用的PLMs的8个典型非分配(OOD)任务进行了评价。我们的调查结果证实,需要提高非分配(ID)任务中的OD准确性,因为与分配(ID)准确性相比,在所有环境中都观察到了显著的性能退化。