Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).
翻译:自然语言处理(NLP)最近通过使用具有数亿参数的庞大预先培训模型而取得了巨大成功。 但是,这些模型具有巨大的模型尺寸和高潜值,因此无法在资源有限的移动设备中部署。 在本文中,我们提议移动BERT用于压缩和加速流行的BERT模型。与最初的BERT一样,移动BERT是任务分析型,也就是说,它可以通过简单的微调普遍应用于下游NLP任务。 基本上,移动BERT是一个薄版的BERT_LARGE, 配有瓶颈结构,以及精心设计的自用和饲料向向上网络之间的平衡。为了培训移动BERT,我们首先培训了专门设计的教师模型,即一个包含BERT_LARTET模型。 然后,我们从这位教师到移动BERT进行知识转让。 爱心化研究表明,移动BERT比BERT要小4.3x和5.5x比BERT_BASE要快,同时在众所周知的基准上取得竞争性的结果。 在低的自然语言中, SSO_BERBO_BO_BO_BO_BAR_0.6级任务上,一个小。