Delta Tinning:关于培训前语言模型的参数有效方法的综合研究 (Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models)

Ning Ding,Yujia Qin,Guang Yang,Fuchao Wei,Zonghan Yang,Yusheng Su,Shengding Hu,Yulin Chen,Chi-Min Chan,Weize Chen,Jing Yi,Weilin Zhao,Xiaozhi Wang,Zhiyuan Liu,Hai-Tao Zheng,Jianfei Chen,Yang Liu,Jie Tang,Juanzi Li,Maosong Sun

from arxiv, 49 pages

Despite the success, the process of fine-tuning large-scale PLMs brings prohibitive adaptation costs. In fact, fine-tuning all the parameters of a colossal model and retaining separate instances for different tasks are practically infeasible. This necessitates a new branch of research focusing on the parameter-efficient adaptation of PLMs, dubbed as delta tuning in this paper. In contrast with the standard fine-tuning, delta tuning only fine-tunes a small portion of the model parameters while keeping the rest untouched, largely reducing both the computation and storage costs. Recent studies have demonstrated that a series of delta tuning methods with distinct tuned parameter selection could achieve performance on a par with full-parameter fine-tuning, suggesting a new promising way of stimulating large-scale PLMs. In this paper, we first formally describe the problem of delta tuning and then comprehensively review recent delta tuning approaches. We also propose a unified categorization criterion that divide existing delta tuning methods into three groups: addition-based, specification-based, and reparameterization-based methods. Though initially proposed as an efficient method to steer large models, we believe that some of the fascinating evidence discovered along with delta tuning could help further reveal the mechanisms of PLMs and even deep neural networks. To this end, we discuss the theoretical principles underlying the effectiveness of delta tuning and propose frameworks to interpret delta tuning from the perspective of optimization and optimal control, respectively. Furthermore, we provide a holistic empirical study of representative methods, where results on over 100 NLP tasks demonstrate a comprehensive performance comparison of different approaches. The experimental results also cover the analysis of combinatorial, scaling and transferable properties of delta tuning.

翻译：尽管取得了成功,但大型PLM的微调过程带来了令人望而却步的适应成本。事实上,微调巨型模型的所有参数和为不同任务保留不同实例实际上是行不通的。这需要一个新的研究分支,侧重于PLM的参数效率调整,本文称之为三角调整。与标准的微调相比,三角调整只是微调一小部分模型参数,同时保持其余部分的不动状态,大大降低计算和储存成本。最近的研究表明,一系列具有不同调控参数选择的三角调整方法可以达到与全度微调相当的业绩,从而提出鼓励大型PLMM的有希望的新方法。在本文中,我们首先正式描述三角调整的问题,然后全面审查最近的三角调整方法。我们还提出了一个统一分类标准,将现有的三角调整方法分为三类:基于新增的、基于规格的和基于再校准的方法。虽然最初提出了一系列具有不同调试调参数选择的三角调整方法,但从整个参数的全度调整结果可以覆盖一个相当有效的方法,从而可以引导大型的PLMML结果分析。我们认为,从这个模型的优化的模型的底部分析,我们可以对底的模型进行某些实验性分析。