Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. Parameter-efficient techniques have been developed that tune small trainable components (e.g., adapters) injected in the large model while keeping most of the model weights frozen. The prevalent mechanism to increase adapter capacity is to increase the bottleneck dimension which increases the adapter parameters. In this work, we introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques. (i) We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. (ii) We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. We demonstrate these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks. By only tuning 0.23% of a pre-trained language model's parameters, our model outperforms the full model fine-tuning performance and several competing methods.
翻译:将大规模培训前语言模型微调成下游任务需要更新数亿个参数。 这不仅增加了存储每个任务大量模型加权数的服务成本,而且还在微小任务调整过程中表现出不稳定性。 已经开发出参数效率技术,在大型模型中调试小的可培训组件(如适配器),同时将大部分模型重量冻结下来。 提高适应能力的普遍机制是增加瓶颈层面,增加适配参数。 在这项工作中,我们引入了一个新的机制,在不增加参数或计算成本的情况下,通过两种关键技术提高适配能力。 (一) 我们在变换结构的每个层中引入多个共享的适应器组件。我们通过随机调整路径来利用稀有的学习来更新可调整参数(电码被固定下来),结果计算成本(FLOPs)与培训单一的适应者相同。 (二) 我们建议一个简单的合并机制,将多个调整器组件的重量平均到每个变换模型层的单一适应器的重量或计算成本。 (一) 我们通过完全的精确的参数来显示整个变换功能, 包括完全的进度。