Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA .
翻译:对于NLP领域中的下游任务,对大型预训练语言模型进行微调已成为一种重要的模式。然而,常规做法是微调预训练模型中的所有参数,这在存在大量下游任务时就变得不可行。因此,许多微调方法被提出,以以参数高效的方式学习预训练权重的增量更新,如低秩增量。这些方法通常均匀分配增量更新的预算到所有预训练权重矩阵上,并忽视不同权重参数的重要性。因此,微调性能不尽如人意。为了弥补这种差距,我们提出了AdaLoRA,该方法根据它们的重要性分数自适应地分配参数预算。具体而言,AdaLoRA将增量更新参数化为奇异值分解的形式。这种新颖的方法能够有效地减少不重要更新的奇异值,从而减少它们的参数预算,但避免了繁琐的SVD运算。我们在自然语言处理、问答和自然语言生成等几个预训练模型上进行了广泛的实验,验证了AdaLoRA的有效性。结果表明,在低预算设置下,AdaLoRA表现出明显的优势。我们的代码公开在https://github.com/QingruZhang/AdaLoRA。