We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9$\times$ on Wiki-40B and 12.1$\times$ on PG-19 for auto-regressive language modeling, and 4.8$\times$ on C4 for masked language modeling.
翻译:我们重新审视了变换器的设计选择,并提出了解决其在处理长序列方面的弱点的方法。 首先,我们提议了一个简单的层,名为门式注意单元,允许使用一个质量损失最小的弱小的单头注意点,然后我们提出一个线性近似方法,以补充这个新的层,这个新层对加速器有利,质量竞争激烈。 由此产生的模型名为FLASH, 与改进的变换器在短(512)和长(8K)上下文长度上的复杂之处相匹配,使培训速度达到4.9美元,即Wiki-40B和12.1美元,即PG-19用于自动递增语言模型的培训速度,以及4.8美元C4用于蒙面语言模型的培训速度。