Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g. only using 2% of parameters) inside a pre-trained backbone network for a new task, they only reduce the training memory requirement by up to 30%. This is because the gradient computation for the trainable parameters still requires backpropagation through the large pre-trained backbone model. To address this, we propose Ladder Side-Tuning (LST), a new PETL technique that can reduce training memory requirements by more substantial amounts. Unlike existing parameter-efficient methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (called ladders) from backbone networks and makes predictions. LST has significantly lower memory requirements than previous methods, because it does not require backpropagation through the backbone network, but instead only through the side network and ladder connections. We evaluate our method with various models (T5 and CLIP-T5) on both NLP (GLUE) and vision-and-language (VQA, GQA, NLVR2 , MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and LoRA in a low-memory regime. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models, attaining better GLUE performance than full fine-tuning and other PETL methods. The accuracy-efficiency trade-off also holds on VL tasks.
翻译:最近在许多领域采用了关于下游任务的大规模预培训模型。 然而, 更新大型预培训模型的整个参数集是昂贵的。 虽然最近提出的参数效率转移学习( PETL) 技术允许在经过培训的骨干网络中更新一小部分参数( 仅使用2%的参数) 用于新任务, 它们只会将培训记忆效率要求降低到30%。 这是因为对可培训参数的梯度计算仍然需要通过经过培训的大型骨干模型进行反向调整。 为了解决这个问题, 我们建议了 Ladder 侧向调整( LST), 这是一种新的精密的 PETL 技术, 它可以减少培训记忆要求。 与在主干网络中插入额外参数的现有参数效率方法不同, 我们训练了一个梯级侧网络, 一个小型和单独的网络, 将中间的活化作为中连接( 所谓的梯子) 从主干网络( 所谓的梯子) 和预测。 LST 的精细的记忆要求大大低于以前的系统, 因为它不需要通过主干网络进行反向调整, 而是通过边端网络和梯级的精度 QQQ 的内径网络, 我们用了LLLLLL5 格式的系统,, 和高级的预 。 我们用各种方法评估了各种方法,, 将各种方法也用各种方法显示了。