Pre-trained language models (PLMs) have gained increasing popularity due to their compelling prediction performance in diverse natural language processing (NLP) tasks. When formulating a PLM-based prediction pipeline for NLP tasks, it is also crucial for the pipeline to minimize the calibration error, especially in safety-critical applications. That is, the pipeline should reliably indicate when we can trust its predictions. In particular, there are various considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3) the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and many more. Although prior work has looked into some of these considerations, they usually draw conclusions based on a limited scope of empirical studies. There still lacks a holistic analysis on how to compose a well-calibrated PLM-based prediction pipeline. To fill this void, we compare a wide range of popular options for each consideration based on three prevalent NLP classification tasks and the setting of domain shift. In response, we recommend the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal Loss for fine-tuning.
翻译:预先培训的语言模型(PLM)由于其在各种自然语言处理(NLP)任务中的令人信服的预测性表现而越来越受欢迎。在为NLP任务制定基于PLM的预测管道时,对于输油管道尽量减少校准错误,特别是安全关键应用程序的校准错误,也是至关重要的。也就是说,输油管道应可靠地表明何时我们能够相信其预测。特别是,管道背后有各种考虑:(1) 选择和(2) PLM的规模,(3) 选择不确定性限定符,(4) 选择微调损失,等等。虽然以前的工作已经研究过其中的一些考虑因素,但通常根据有限的经验研究范围得出结论。对于如何构建一个校准良好的PLM预测管道,仍然缺乏全面分析。为了填补这一空白,我们根据三种普遍的NLP分类任务和域变换的设置,对每一种考虑都比较了广泛的流行选择。我们建议:(1) 使用ELECTRA进行CPM编码,(2) 尽可能使用更大的PLM(如果可能的话) 使用TRA作为不确定性定值调整,(4) 使用TRA作为核心损失。