The growing size of neural language models has led to increased attention in model compression. The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data. Our key insight is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop, showing its effectiveness and efficiency compared to previous pruning and distillation approaches.
翻译:神经语言模型的日益缩小导致对模型压缩的关注增加。 两种主要的方法是裁剪, 逐渐去除预培训模型的重量, 以及蒸馏, 将一个较小的紧凑模型训练成一个更大的模型。 节制方法可以大大缩小模型的大小, 但却很难达到作为蒸馏的大型加速。 然而, 蒸馏方法需要大量无标签数据, 并且培训费用昂贵。 在这项工作中, 我们提议了一个任务特定的结构裁剪方法 COFi( 粗糙和精细的预压), 提供高度平行的子网络, 并在精确性和延缓性两方面匹配蒸馏方法, 而不使用任何不贴标签的数据。 我们的主要洞察力是联合粗略混凝( 如, 层) 和精细精密的( 如, 头和隐藏单位) 模块, 用以控制每个参数与不同颗粒质的遮罩( 粗颗粒和精细精细的预压方法) COFI, 我们还设计一个层次的蒸馏战略, 将知识从未扎的细细细细的细细的细的细的细的稀释分解再转移到精细的细的蒸馏方法,, 在优化过程中, 显示的模型中, 展示前的精度模型显示的精度的精度的精度, 并展示了它的精度的精度, QUADUADUAUADUA 的精度的精度的精度的精度的精度的精度的精度, 。