Fine-tuning pre-trained language models on downstream tasks with varying random seeds has been shown to be unstable, especially on small datasets. Many previous studies have investigated this instability and proposed methods to mitigate it. However, most studies only used the standard deviation of performance scores (SD) as their measure, which is a narrow characterization of instability. In this paper, we analyze SD and six other measures quantifying instability at different levels of granularity. Moreover, we propose a systematic framework to evaluate the validity of these measures. Finally, we analyze the consistency and difference between different measures by reassessing existing instability mitigation methods. We hope our results will inform the development of better measurements of fine-tuning instability.
翻译:关于使用各种随机种子的下游任务、特别是小型数据集的经培训的预先精密语言模型的微调,已证明不稳定,特别是小型数据集的不稳定性。以前的许多研究调查了这种不稳定性,并提出了减轻这种不稳定性的方法。然而,大多数研究只使用业绩分数的标准差作为衡量不稳定性的尺度,这是对不稳定性的狭义描述。在本文件中,我们分析可持续发展和其他六项量化不同微粒级不稳定性的措施。此外,我们提议了一个系统框架来评价这些措施的有效性。最后,我们通过重新评估现有的减少不稳定性的方法来分析不同措施的一致性和差异。我们希望我们的结果能为更好地衡量微调不稳定性提供参考。