Recently, large pre-trained models have significantly improved the performance of various Natural LanguageProcessing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-to-end model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization to conduct multi-objective Neural Architecture Search for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. The experiments on TPUv4i show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44x faster) than MobileBERT. By running downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT_BASE, DistillBERT, TinyBERT, NAS-BERT, and MobileBERT. The most compact model found by AutoDistill contains only 20.6M parameters but still outperform BERT_BASE(109M), DistillBERT(67M), TinyBERT(67M), and MobileBERT(25.3M) regarding the average GLUE score. By evaluating on SQuAD, a model found by AutoDistill achieves an 88.4% F1 score with 22.8M parameters, which reduces parameters by more than 62% while maintaining higher accuracy than DistillBERT, TinyBERT, and NAS-BERT.
翻译:最近,大型预培训模型大大改善了各种自然语言处理参数( NLP) 的性能,但由于长期使用延迟和大量内存使用,这些模型的使用费用昂贵。为了压缩这些模型,知识蒸馏吸引了越来越多的兴趣,作为最有效的模型压缩方法之一。然而,现有的蒸馏方法尚未解决在数据中心服务模型的独特挑战,例如处理快速演变模型、考虑服务性能和优化多重目标。为了解决这些问题,我们提议AutoDustilling,一个将模型结构探索和多目标优化相结合的终端至终端模型蒸馏框架,用于建立具有硬件效率的 NLPP预培训模型。我们使用Besian Oppimation 来进行多目标神经结构搜索。拟议的蒸馏方法尚未全面考虑到目标硬件的预测准确性和通度。 TPUpev4i实验显示,7个模型的发现,其预培训前精度(最高达3.2%) 以及更精确度(最高保持)和低精度的模板(最高达1.44x) 的模型,在移动BER 基准下,SDreal-B 中发现,其精度为20B。