It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
翻译:通过微调预先培训语言模型(LMS),特别是在低数据环境下,解决NLP任务已成为标准。对于经验成功的理论理解极少,例如,为什么在几十个培训点上微调10美元或更多参数的模型不会导致过度适应。我们调查NTK镜头(NTK)作为研究无限宽网络的梯度下降动态和适当随机初始化的模型,是否体现了预先培训语言模型的微调。这项研究的灵感来自NTK在计算机愿景任务方面的体面表现(Wei等人,2022年)。我们把NTK正规主义扩大到亚当,并使用Tensor方案(Yang,2020年)来描述NTK镜头可以描述对预先培训的语言模型进行微调更新的条件。对14个NLP任务进行的广泛实验证实了我们的理论,并表明通过在微调过程中经常导导出内核动态,将下游任务发展成一个掩盖词的预测问题。最后,我们利用这一空间节流调的参数提出一个次空间节流解释。