Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The code to reproduce this work is publicly available on Github.
翻译:有限的计算预算往往阻止变压器用于生产和高精确度的利用。知识蒸馏方法通过将自我蒸馏BERT(LAT)技术部分应用到TinyBERT,在BERT基地上实现x3加速,且精度损失最小,从而解决计算效率问题。然而,随着我们减少层数,这些模型的性能下降,特别是在先进的NLP任务中,例如问题解答等高级NLP任务中。此外,必须用不同的计算预算预算,为每种推论情景分别培训一个单独的模型。动态-TinyBERT解决这两个限制,在TinyBERT(LAT)部分采用LAT适应变压器技术,在BERT基地上实现x3加速,且精度损失最小。在这项工作中,我们扩展了动态-TinyBERT方法,以产生一个效率更高的模型。我们与LAT方法联合使用MiniLM蒸馏法,我们通过采用低位四分裁量法进一步提高效率。我们量化的MiniLM模型(QuALA-MLMMM)只对两者进行了一次培训,在任何动态的精确度上都符合BERTBERT-C-QQQQQQA-S-S-S-QQQA-S-SQQQA), 并实现任何可用的预算的精度的精确度的精确度的精确度的精确度计算。