Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The code to reproduce this work will be publicly released on Github soon.
翻译:有限的计算预算往往阻止变压器用于生产和高精确度的利用。知识蒸馏方法通过将自我蒸馏BERT(LAT)技术部分应用到TinyBERT,在BERT基地上实现x3加速,且精度损失最小,从而解决计算效率问题。然而,随着我们减少层数,这些模型的性能下降,特别是在先进的NLP任务中,例如问题解答等高级NLP任务中。此外,必须用不同的计算预算预算预算来为每个推论设想单设计一个单独的模型。动态-TinyBERT解决了这两个限制,在TinyBERT(LAT)技术部分实施后,在BERT(BERT)基地上实现x3加速,且精度损失最小。在这项工作中,我们扩展了动态-TinyBERT方法,以产生一个效率更高的模型。我们与LAT方法共同使用MiniLM蒸馏法,我们通过应用低位量的夸度的量来进一步提高效率。我们量化的MILM模型(Qua-MILM(QA-MILM)仅适应性模型(Qua-M)只经过部分培训一次,在TUALA-MITEET-M-M-MINILEARM)仅进行部分部分部分进行,在任何动态的精确度上符合任何精确度的精确度,在任何精确度的精确度假设中,在任何精确度上将实现任何精确度的精确度计算。