The quadratic computational and memory complexities of the Transformer's attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modeling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety
翻译:变形器注意机制的二次计算和内存复杂性限制了其用于模型制作长序列的缩放性。 在本文中,我们提议Luna,这是一个线性统一嵌套式注意机制,它以两个嵌入式线性注意功能接近软体注意线性注意,仅产生线性(相对于四度)时间和空间复杂性。具体地说,在第一个注意功能下,Luna将输入序列包装成一个固定长度的序列。然后,用第二个注意功能对包装序列进行拆解。与较传统的注意机制相比,Luna引入了另一个具有固定长度的附加序列,作为投入和额外的相应输出,使Luna能够进行线性注意操作,同时储存适当的背景信息。我们对三个测序任务基准进行了广泛的评价:长方位测序模型、神经机翻译和大型训练前的蒙面语言模型。竞争性或更好的实验结果显示Luna相对于各种模型的有效性和效率。