Fine-tuning large pre-trained models on downstream tasks has been adopted in a variety of domains recently. However, it is costly to update the entire parameter set of large pre-trained models. Although recently proposed parameter-efficient transfer learning (PETL) techniques allow updating a small subset of parameters (e.g. only using 2% of parameters) inside a pre-trained backbone network for a new task, they only reduce the training memory requirement by up to 30%. This is because the gradient computation for the trainable parameters still requires backpropagation through the large pre-trained backbone model. To address this, we propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts. Unlike existing parameter-efficient methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (ladders) from backbone networks and makes predictions. LST has significantly lower memory requirements than previous methods, because it does not require backpropagation through the backbone network, but instead only through the side network and ladder connections. We evaluate our method with various models (T5, CLIP-T5) on both NLP (GLUE) and vision-language (VQA, GQA, NLVR2, MSCOCO) tasks. LST saves 69% of the memory costs to fine-tune the whole network, while other methods only save 26% of that in similar parameter usages (hence, 2.7x more memory savings). Moreover, LST achieves higher accuracy than Adapter and LoRA in a low-memory regime. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models (T5-large, T5-3B), attaining better GLUE performance than full fine-tuning and other PETL methods. The exact same trend also holds in our experiments on VL tasks.
翻译:对下游任务进行精细调整的大型预培训模型最近已在多个领域得到采用。然而,更新大型预培训模型的整个参数集代价高昂。虽然最近提出的参数效率转移学习(PETL)技术允许在经过预先培训的骨干网络中更新一小部分参数(例如仅使用2%参数),但仅能将培训记忆需求降低到30%。这是因为对可培训参数的梯度计算仍需要通过大型预培训的骨干模型进行反向调整。为此,我们建议使用一个更大的Tadder 侧向调整(LST),这是一个新的PETL技术,通过更多数量来减少培训记忆要求。与在骨干网络中插入额外参数的现有参数效率方法不同,我们训练了一个梯级侧网络,一个小型和单独的网络,通过主干网的快捷键连接(Lddders)将中间的内存需求降低到30%。LST比以前的低的内存要求要低得多,因为它不需要通过主干网络进行反向调整,而只是通过精细的边端网络和梯级的内存 QL5(我们用各种方法)方法评估各种方法(LST的内存效率方法显示各种方法)。