State-of-the-art pretrained NLP models contain a hundred million to trillion parameters. Adapters provide a parameter-efficient alternative for the full finetuning in which we can only finetune lightweight neural network layers on top of pretrained weights. Adapter layers are initialized randomly. However, existing work uses the same adapter architecture -- i.e., the same adapter layer on top of each layer of the pretrained model -- for every dataset, regardless of the properties of the dataset or the amount of available training data. In this work, we introduce adaptable adapters that contain (1) learning different activation functions for different layers and different input data, and (2) a learnable switch to select and only use the beneficial adapter layers. We show that adaptable adapters achieve on-par performances with the standard adapter architecture while using a considerably smaller number of adapter layers. In addition, we show that the selected adapter architecture by adaptable adapters transfers well across different data settings and similar tasks. We propose to use adaptable adapters for designing efficient and effective adapter architectures. The resulting adapters (a) contain about 50% of the learning parameters of the standard adapter and are therefore more efficient at training and inference, and require less storage space, and (b) achieve considerably higher performances in low-data settings.
翻译:最先进的预先培训的NLP模型包含一亿至万亿个参数。 适应器为全面微调提供了一个具有参数效率的替代参数, 我们只能微调在预培训重量顶部的轻量神经网络层。 适应器层是随机初始化的。 但是, 现有工作对每个数据集都使用相同的适应器结构 -- -- 即预先培训模型顶部的同一适配器层 -- -- 即每个数据集都使用相同的适配器结构, 不论数据集的特性或现有培训数据的数量。 在这项工作中, 我们引入适应器, 包含(1) 学习不同层和不同输入数据的不同激活功能, 以及(2) 选择和仅使用有益的适配层的可学习开关。 我们显示, 适应器在标准适应器结构上, 使用数量要小得多的适应器层。 此外, 我们显示, 由适应器所选择的适应器结构在不同的数据设置和类似任务中转移了很多。 我们提议使用适应器来设计高效有效的适应器结构。 由此产生的适应器(a) 包含大约50%的高效的存储和低空位的存储器设置, 因此, 需要大量的低空控的学习标准的适应器的测试。