Large-scale language model pretraining is a very successful form of self-supervised learning in natural language processing, but it is increasingly expensive to perform as the models and pretraining corpora have become larger over time. We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2\times$. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining, rather than all of the tokens as with the usual transformer encoder. We also show that NarrowBERT increases the throughput at inference time by as much as $3.5\times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI. Finally, we examine the performance of NarrowBERT on the IMDB and Amazon reviews classification and CoNLL NER tasks and show that it is also comparable to standard BERT performance.
翻译:大型语言模型预培训是自然语言处理中一种非常成功的自我监督学习形式,但随着模型和预培训公司逐渐扩大,其运作成本越来越昂贵。我们建议使用一个经过改造的变压器编码器,将隐形语言模型预培训的吞吐量增加2美元以上。 狭义测试器将变压器模型的吞吐量增加2美元以上。 狭义测试器将变压器模型改装成变压器,使自我注意查询和进料层只能在每个句子的蒙面牌上操作,而不是像通常的变压器编码器一样使用所有代号。 我们还表明,纳罗BERT将发酵时的吞吐量增加3.5美元,而像MNLI这样的句码任务,其性能降低到最低限度(或没有)。 最后,我们检查纳罗巴热器在IMDB和亚马孙审查分类和CONL NER任务时的表现,并表明它也与标准BERT的性能相当。