Methods for improving the efficiency of deep network training (i.e. the resources required to achieve a given level of model quality) are of immediate benefit to deep learning practitioners. Distillation is typically used to compress models or improve model quality, but it's unclear if distillation actually improves training efficiency. Can the quality improvements of distillation be converted into training speed-ups, or do they simply increase final model quality with no resource savings? We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100). We found that distillation can speed up training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x on BERT when evaluated on GLUE. Furthermore, distillation for BERT yields optimal results when it is only performed for the first 20-50% of training. We also observed that training with distillation is almost always more efficient than training without distillation, even when using the poorest-quality model as a teacher, in both ResNet-50 and BERT. Finally, we found that it's possible to gain the benefit of distilling from an ensemble of teacher models, which has O(n) runtime cost, by randomly sampling a single teacher from the pool of teacher models on each step, which only has a O(1) runtime cost. Taken together, these results show that distillation can substantially improve training efficiency in both image classification and language modeling, and that a few simple optimizations to distillation protocols can further enhance these efficiency improvements.
翻译:提高深网络培训效率的方法( 即实现一定的模型质量水平所需要的资源) 直接有益于深层学习实践者。 蒸馏通常用于压缩模型或改进模型质量, 但不清楚蒸馏是否实际上提高了培训效率。 蒸馏质量能否转化成培训速度提升, 或者它们是否只是提高最后的模型质量而没有节省资源? 我们进行了一系列实验,以研究是否以及如何利用蒸馏方法加快培训速度,利用在图像网和BERT上培训的ResNet-50来加快培训,利用图像网和BERT培训C4来加速培训。 我们发现,蒸馏通常使用隐蔽语言建模,使用通用企业硬件( 8x NVIDIA A100 ) 进行评估。 我们发现,蒸馏蒸馏可以加快1.96x在图像网模型上培训质量提升培训效率,在GLUE评估这些模型时, 蒸馏教师的蒸馏效果会更优化, 仅仅为前20- 50 % 培训。 我们还发现, 蒸馏语言培训总效率比培训最短的O 提高成本, 最终在教师的精度模型中发现, 也能够提高。