DSEE: 培训前语言模式的双分制高效调试 (DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models)

Gigantic pre-trained models have become central to natural language processing (NLP), serving as the starting point for fine-tuning towards a range of downstream tasks. However, two pain points persist for this paradigm: (a) as the pre-trained models grow bigger (e.g., 175B parameters for GPT-3), even the fine-tuning process can be time-consuming and computationally expensive; (b) the fine-tuned model has the same size as its starting point by default, which is neither sensible due to its more specialized functionality, nor practical since many fine-tuned models will be deployed in resource-constrained environments. To address these pain points, we propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning - by enforcing sparsity-aware weight updates on top of the pre-trained weights; and (ii) resource-efficient inference - by encouraging a sparse weight structure towards the final fine-tuned model. We leverage sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via magnitude-based pruning and $\ell_1$ sparse regularization. Extensive experiments and in-depth investigations, with diverse network backbones (i.e., BERT, GPT-2, and DeBERTa) on dozens of datasets, consistently demonstrate highly impressive parameter-/training-/inference-efficiency, while maintaining competitive downstream transfer performance. For instance, our DSEE-BERT obtains about $35\%$ inference FLOPs savings with <1% trainable parameters and comparable performance to conventional fine-tuning. Codes are available in https://github.com/VITA-Group/DSEE.

翻译：对自然语言处理(NLP)来说,经过精密训练的模型已成为自然语言处理(NLP)的核心,作为微调用于一系列下游任务的起点。然而,这一范式仍然存在两个疼痛点:(a)随着预先训练的模型(例如GPT的175B参数)的扩大(GPT的175B参数),甚至微调过程也可能耗费时间和计算成本;(b)微调模式与默认的起点大小相同,由于它更专业化的功能,而且由于许多经过微调的模型将部署在资源紧张的环境中,因此它既不合理,也不实用。为了解决这些疼痛点,我们提议了一个框架,通过利用以前重量更新和最终模型的松散性能来调整资源和参数效率的微调。我们提议的框架,调双调的DOD-RED(DEEEE),目的是实现两个关键目标:(i) 基准值的微调(i) 通过在经过训练的重量前的环境下执行松动,对调的士/感官的重量更新;以及(ii) 资源效率在经过不断调整的Rireal-real-ral iral vial vial view 上,通过不断的Siral-real-real vial view view Stal view view view views),我们用这些结构的节制的节制的节制的节制的精制的节制的节制。