Fine-tuning pre-trained models has been ubiquitously proven to be effective in a wide range of NLP tasks. However, fine-tuning the whole model is parameter inefficient as it always yields an entirely new model for each task. Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks. These methods achieve surprisingly good performance and are shown to be more stable than their corresponding fully fine-tuned counterparts. However, such kind of methods is still not well understood. Some natural questions arise: How does the parameter sparsity lead to promising performance? Why is the model more stable than the fully fine-tuned models? How to choose the tunable parameters? In this paper, we first categorize the existing methods into random approaches, rule-based approaches, and projection-based approaches based on how they choose which parameters to tune. Then, we show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them. We indicate that the sparsity is actually imposing a regularization on the original model by controlling the upper bound of the stability. Such stability leads to better generalization capability which has been empirically observed in a lot of recent research works. Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters. To better choose the tunable parameters, we propose a novel Second-order Approximation Method (SAM) which approximates the original problem with an analytically solvable optimization function. The tunable parameters are determined by directly optimizing the approximation function. The experimental results show that our proposed SAM model outperforms many strong baseline models and it also verifies our theoretical analysis.
翻译:微调前经过训练的模型被普遍证明在一系列NLP任务中是有效的。然而,微调整个模型的参数是低效的,因为它总是产生一个全新的每个任务模式。目前,许多研究工作只建议微调一小部分参数,同时保留不同任务所共有的大多数参数。这些方法的性能令人惊讶地好,并显示比相应的完全精细调的模型更稳定。但是,这类方法仍然不甚为人知。一些自然参数出现:参数宽度如何导致有希望的性能?为什么整个模型比完全精细调的模型更稳定?为什么模型比完全精调的模型更稳定?如何选择金枪鱼参数?在本文件中,我们首先将现有方法分为随机方法、基于规则的方法和基于预测的方法,同时根据它们如何选择调和的参数。然后,我们显示所有方法实际上微调的模型都很少,并且对它们进行新的理论分析。我们指出,通过微调的精确度实际上正在通过控制原模型的上下限,使原始模型正常化?为什么这个模型比重更稳定模型更稳定? 如何稳定,这种稳定性的精确性能显示一种更精确的精确的精确性功能,我们如何?