Pre-trained language models (PLMs) achieve remarkable performance on many downstream tasks, but may fail in giving reliable estimates of their predictive uncertainty. Given the lack of a comprehensive understanding of PLMs calibration, we take a close look into this new research problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs' calibration performance in training. We consider six factors as control variables, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. In experiments, we observe a consistent change in calibration performance across six factors. We find that PLMs don't learn to become calibrated in training, evidenced by the continual increase in confidence, no matter the predictions are correct or not. We highlight that our finding presents some contradiction with two established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining improves model calibration. Next, we study the effectiveness of existing calibration methods in mitigating the overconfidence issue, in both in-distribution and various out-of-distribution settings. Besides unlearnable calibration methods, we adapt two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations. Also, we propose extended learnable methods based on existing ones to further improve or maintain PLMs calibration without sacrificing the original task performance. Experimental results show that learnable methods significantly reduce PLMs' confidence in wrong predictions, and our methods exhibit superior performance compared with previous methods.
翻译:培训前语言模型(PLM)在许多下游任务上取得了显著成绩,但可能无法对其预测性不确定性作出可靠的估计。鉴于对PLM校准缺乏全面了解,我们仔细研究这一新的研究问题,目的是回答两个问题:(1) 是否PLM学会在培训过程中校准?(2) 现有校准方法的效力如何?对于第一个问题,我们进行细微的对照实验,以研究PLM校准性能的动态变化。我们认为六个因素是控制变量,包括数据设置困难、现有培训样本、培训步骤、金枪鱼可选参数的数量、模型规模和预培训。在实验中,我们观察到校准性绩效方面有一致的变化。我们发现,PLMS没有学会在培训中校准,从信任的不断提高,无论预测是否正确。我们强调,我们的调查结果与两个既定结论有些矛盾:(a) 更大的PLMS更能更精确地学习,(bretra) 改进模型校准。接下来,我们研究的是,我们研究现有的校准性调整方法的有效性,在降低现有校正方法的难度方面,最近,我们提出了在降低升级的方法。