With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good" sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT$_{\mathrm{BASE}}$. On QNLI and RTE, we obtain 75\% and 42.8\% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40\% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERT$_{\mathrm{BASE}}$ encoder on Xilinx Alveo U200 FPGA board has a 1.83$\times$ speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated subnets of BERT$_{\mathrm{BASE}}$ model on computation restricted devices.
翻译:随着深层学习民主化的一年期,在资源限制的低延迟和高精度设备上,对基于资源限制的装置采用基于变压器的自然语言处理(NLP)模型的需求日益增长。现有的BERT运行方法要求使用精巧手工艺超强参数的域专家在模型大小、延缓度和精确度之间取得平衡。在这项工作中,我们提议AE-BERT,一个自动和高效的BERT运行框架,在总体调整比率的限制下,选择一个“良好”的子网络候选(高精度)子网络(NLP)模型。我们提议的方法不需要人类专家经验,并且在许多NLP任务上取得更好的精确性性能。我们在通用语言理解评价(GLUE)基准上的实验结果显示,AE-BERT在模型的状态(SOTA)上,在BMERTA中,我们得到了比SO-SERSOO的SALSER 成本比率更高一个比SOTA的方法。在SURA中,我们得到了一个比SOALTA的直流的直径比的直方计算方法。