A wide range of NLP tasks benefit from the fine-tuning of pretrained language models (PLMs). However, a number of redundant parameters which contribute less to the downstream task are observed in a directly fine-tuned model. We consider the gap between pretraining and downstream tasks hinders the training of these redundant parameters, and results in a suboptimal performance of the overall model. In this paper, we present PATS (Perturbation According To Sensitivity), a noisy training mechanism which considers each parameter's importance in the downstream task to help fine-tune PLMs. The main idea of PATS is to add bigger noise to parameters with lower sensitivity and vice versa, in order to activate more parameters' contributions to downstream tasks without affecting the sensitive ones much. Extensive experiments conducted on different tasks of the GLUE benchmark show PATS can consistently empower the fine-tuning of different sizes of PLMs, and the parameters in the well-performing models always have more concentrated distributions of sensitivities, which experimentally proves the effectiveness of our method.
翻译:培训前语言模型(PLMs)的微调使国家语言模型(PLP)的广泛任务受益。然而,在直接微调的模型中观察到了一些对下游任务贡献较少的冗余参数。我们认为,培训前任务和下游任务之间的差距妨碍了这些冗余参数的培训,并导致整个模型的不最佳性能。在本文件中,我们介绍了PATS(Perturbation Accessivenity),这是一个吵闹的培训机制,它考虑到每个参数在下游任务中的重要性,有助于微调PLMs。PATS的主要想法是给敏感度较低的参数增加更大的噪音,反之亦然,以便在不严重影响敏感参数的情况下激活下游任务更多的参数贡献。GLUE基准的不同任务中进行的广泛实验显示PATS可以不断增强对不同尺寸的PLMs的微调能力,而运行良好的模型中的参数总是有更集中的敏感性分布,这实验证明了我们的方法的有效性。