Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower resource utilization.
翻译:在记忆、计算和资源利用方面,基于BERT的微调模型需要大量资源。虽然许多先前的工作都旨在通过压缩技术(例如裁剪)提高推论效率,但这些工程并没有明确地解决对下游任务培训的计算挑战。我们引入了学习者模块和尖锐的微调新方法,利用预先培训的语言模型的超度分法来获得趋同速度和资源利用方面的效益。学习者模块通过微调一组参数来有效处理1)培训的双重约束,以及2)通过确保快速趋同和高分的有效培训。我们关于DistillBERT的研究结果表明,学习者的表现与基线相同或超过基准。学习者在GLUE上培训比最先进的方法少7x参数。关于COLA,学习者们的微调率为20%,而且资源利用率低得多。