Pruning aims to reduce the number of parameters while maintaining performance close to the original network. This work proposes a novel \emph{self-distillation} based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed {\em cross-correlation objective for self-distilled pruning} implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.
翻译:在保持与原始网络接近的性能的同时, 使用蒸馏法来减少参数数量, 并同时保持与原始网络的性能 。 这项工作提出了一个基于新颖的 emph{ 自我蒸馏} 的裁剪策略, 从而将同一网络的纯度和未纯度版本之间的代表性相似性最大化。 与以前分别处理蒸馏和纯度的方法不同, 我们使用蒸馏法来告知裁剪标准, 而不需要像知识蒸馏那样单独的学生网络 。 我们显示, 提议的 exem 交叉关系目标, 用于自我蒸馏 } 暗地鼓励稀释的解决方案, 自然补充基于规模的裁剪裁标准 。 GLUE 和 XGLUE 基准的实验表明, 自我蒸馏的裁剪裁会提高单一和跨语言模式的性能。 自蒸馏的模型也比较小的变形变形小, 参数相同, 并且对更大的蒸馏网络具有竞争力 ( 6 次) 。 我们还观察到, 自蒸馏法(1) 能够最大化的类 、 (2) 增加级 增加 信号到 进一步的普通的升级 步骤 。