Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.
翻译:在自然语言处理(如机器翻译、问答)中,变压器已变得无处不在;然而,它需要大量计算才能达到高性能,因此不适合硬件资源和电池严格制约的移动应用程序。在本文中,我们展示了一个高效的移动NLP架构,即利特变压器,以便利在边缘设备上部署移动NLP应用程序。关键原始点是长距离注意(LSRA),其中一组头目专门从事当地上下级变压模型(通过熔化),而另一组头则专门从事长距离关系模型(通过注意),因此,这种变压器不适合三种成熟的语言任务:机器翻译、抽象拼凑L和语言建模。在受限资源(500M/100M MACs)下,利特变压器超越WMT14英语-法语变压器的变压器,而没有1.2/1.7 BLEU, 基础值。利特变压器将变压器基础模型的计算结果降低2.5x,在0.3 BLEU 评分分降解。在18年的变压式变压式变压器/平式变压中,在18的变压中,在18的变压中实现变压。