Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.
翻译:培训前方法提高了分类和生成任务的模式准确性,而成本是引入大得多和慢得多的模式。 谨慎方法已证明是减少模型规模的有效方法, 而蒸馏方法则被证明是加速推导的有效方法。 我们引入了针对小模型和快模型的区块修剪方法。 我们的方法通过考虑任何大小的区块来扩展结构化方法,并将这一结构纳入运动修剪模式中进行微调。 我们发现, 这种方法学会了精减基本模型的全部组成部分, 如注意头。 实验考虑到分类和生成任务, 除其他结果外, 产生一个速率为2.4x的预处理模型, 在 SQuAD v1 上较小74%的BERT 模型, 在F1 上下降1%, 与速速率和速率小模型的蒸馏模型具有竞争力。