While transferring a pretrained language model, common approaches conventionally attach their task-specific classifiers to the top layer and adapt all the pretrained layers. We investigate whether one could make a task-specific selection on which subset of the layers to adapt and where to place the classifier. The goal is to reduce the computation cost of transfer learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its performance. We propose to select layers based on the variability of their hidden states given a task-specific corpus. We say a layer is already "well-specialized" in a task if the within-class variability of its hidden states is low relative to the between-class variability. Our variability metric is cheap to compute and doesn't need any training or hyperparameter tuning. It is robust to data imbalance and data scarcity. Extensive experiments on the GLUE benchmark demonstrate that selecting layers based on our metric can yield significantly stronger performance than using the same number of top layers and often match the performance of fine-tuning or adapter-tuning the entire language model.
翻译:在转让一个经过预先培训的语言模式时,共同的方法通常将其特定任务分类器附在上层,并适应所有预先培训的层次。我们调查是否可以根据具体任务选择一个需要调整的层次的子集,以及在哪里放置分类器。目标是降低转移学习方法的计算成本(例如微调或调适器),而不会牺牲其性能。我们建议根据隐藏状态的变异性选择一个层次,并给一个任务特定内容。我们说,如果一个层次的隐藏状态的分类内部变异与等级之间的变异相比较低,那么这个层次已经“高度专业化 ” 。我们的变异度衡量标准是廉价的,不需要任何培训或超参数的调整。对于数据不平衡和数据稀缺是强大的。关于GLUE基准的广泛实验表明,根据我们的标准选择的层次,其性能会大大超过使用同样数量的顶层,而且往往符合对整个语言模式进行微调或调整的性能。