While pre-trained language models (PLMs) have become a de-facto standard promoting the accuracy of text classification tasks, recent studies find that PLMs often predict over-confidently. Although various calibration methods have been proposed, such as ensemble learning and data augmentation, most of the methods have been verified in computer vision benchmarks rather than in PLM-based text classification tasks. In this paper, we present an empirical study on confidence calibration for PLMs, addressing three categories, including confidence penalty losses, data augmentations, and ensemble methods. We find that the ensemble model overfitted to the training set shows sub-par calibration performance and also observe that PLMs trained with confidence penalty loss have a trade-off between calibration and accuracy. Building on these observations, we propose the Calibrated PLM (CALL), a combination of calibration techniques. The CALL complements the drawbacks that may occur when utilizing a calibration method individually and boosts both classification and calibration accuracy. Design choices in CALL's training procedures are extensively studied, and we provide a detailed analysis of how calibration techniques affect the calibration performance of PLMs.
翻译:虽然经过培训的语文模型(PLM)已成为促进文本分类任务准确性的一个实际标准,但最近的研究发现,PLM常常会过于自信地预测。虽然提出了各种校准方法,例如混合学习和数据增强,但大多数方法都是在计算机愿景基准而不是基于PLM的文本分类任务中核实的。在本文件中,我们介绍了一项关于PLM信任校准的经验性研究,涉及三个类别,包括信心罚款损失、数据增强和混合方法。我们发现,与培训组配得过高的合用模型显示分校准性能,并且还发现,经过培训的具有信心损失的PLMS在校准和准确性之间有着一种权衡。我们根据这些观察,建议校准 PLM(CAL)是校准技术的组合。电话补充了在使用校准方法时可能发生的缺点,即单独使用校准方法,提高分类和校准准确性。我们广泛研究了ACT培训程序的设计选择,我们详细分析了校准技术如何影响PLM的校准性。