The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
翻译:变形者关注机制的设计选择,包括微弱的感性偏差和二次计算复杂性,限制了其对长序列模型的应用。在本文中,我们引入了Mega,这是一个简单的、理论上有根据的、单头门式的注意机制,它配备了(特效)平均移动机制,将定位偏差的局部依赖感的感性偏差纳入位置偏心机制。我们进一步提出了Mega的变式,它提供线性时间和空间复杂性,但只产生最低质量损失,将整个序列有效地分成多个固定长度的块块。关于一系列广泛的序列建模基准的广泛实验,包括长区域、神经机器翻译、自动递减语言建模以及图像和语音分类,表明Mega比其他序列模型,包括变异变式和最近的国家空间模型,取得了显著的改进。