The highly popular Transformer architecture, based on self-attention, is the foundation of large pretrained models such as BERT, that have become an enduring paradigm in NLP. While powerful, the computational resources and time required to pretrain such models can be prohibitive. In this work, we present an alternative self-attention architecture, Shatter, that more efficiently encodes sequence information by softly partitioning the space of relative positions and applying different value matrices to different parts of the sequence. This mechanism further allows us to simplify the multi-headed attention in Transformer to single-headed. We conduct extensive experiments showing that Shatter achieves better performance than BERT, with pretraining being faster per step (15% on TPU), converging in fewer steps, and offering considerable memory savings (>50%). Put together, Shatter can be pretrained on 8 V100 GPUs in 7 days, and match the performance of BERT_Base -- making the cost of pretraining much more affordable.
翻译:高度流行的基于自我注意的变压器结构是大型预先培训模型的基础,如BERT,这些模型已经成为NLP的持久范例。 虽然其实力强大,但预演这些模型所需的计算资源和时间可能令人望而却步。在这项工作中,我们提出了一个替代性的自留结构,即Shashet,它通过软分割相对位置的空间和将不同的价值矩阵应用到序列的不同部分来更有效地编码序列信息。这个机制进一步使我们能够简化变压器中多头的注意力到单头的。我们进行了广泛的实验,表明散压器比BERT取得较好的性能,每步(在TPU上,15 % ) 的预培训速度要快, 以更少的步骤聚合, 并提供大量记忆节省( > 50% ) 。加在一起, 碎片可以在7天内对8 V100 GPPS进行预先训练, 并匹配BERT_Base的性能 -- 使得预培训成本更低得多。