Pre-trained language models achieve superior performance, but they are computationally expensive due to their large size. Techniques such as pruning and knowledge distillation (KD) have been developed to reduce their size and latency. In most structural pruning methods, the pruning units, such as attention heads and feed-forward hidden dimensions, only span a small model structure space and limit the structures that the pruning algorithm can explore. In this work, we propose Gradient-based Intra-attention pruning (GRAIN), which inspects fine intra-attention structures, and allows different heads to have different sizes. Intra-attention pruning greatly expands the searching space of model structures and yields highly heterogeneous structures. We further propose structure regularization to encourage generating more regular structures, which achieves higher speedups than heterogeneous ones. We also integrate KD into the pruning process with a gradient separation strategy to reduce the interference of KD with the pruning process. GRAIN is evaluated on a variety of tasks. Results show that it notably outperforms other methods at the same or similar model size. Even under extreme compression where only $3\%$ weights in transformers remain, the pruned model is still competitive.
翻译:经过培训的语文模型达到优异的性能,但由于规模大,它们计算成本很高。已经开发了诸如裁剪和知识蒸馏(KD)等技术来缩小其尺寸和耐久性。在大多数结构裁剪方法中,诸如注意力头和饲料向前隐藏维度等修剪单元只覆盖一个小型模型结构空间,并限制修剪算算法可以探索的结构。在这项工作中,我们提议以“渐进式”为基础的“内部注意性裁剪”(GRIAIN),它检查良好的内部注意结构,允许不同的头部有不同的尺寸。在设计结构的搜索空间大为扩展,并产生高度多样化的结构。我们进一步建议结构规范化,鼓励建立比变异性结构更快的更常规结构。我们还将KD纳入剪裁过程中过程,采用梯度分离战略来减少KD对裁剪裁过程的干扰。GRAIN在各种任务上进行了评估。结果显示,它明显优于相同或类似模型中的其他方法,在模型中比重的模型中比重。甚至处于极端压缩之下。