Transformers achieve remarkable performance in various domains, including NLP, CV, audio processing, and graph analysis. However, they do not scale well on long sequence tasks due to their quadratic complexity w.r.t. the inputs length. Linear Transformers were proposed to address this limitation. However, these models have shown weaker performance on the long sequence tasks comparing to the original one. In this paper, we explore Linear Transformer models, rethinking their two core components. Firstly, we improved Linear Transformer with Shift-Invariant Kernel Function SIKF, which achieve higher accuracy without loss in speed. Secondly, we introduce FastRPB which stands for Fast Relative Positional Bias, which efficiently adds positional information to self-attention using Fast Fourier Transformation. FastRPB is independent of the self-attention mechanism and can be combined with an original self-attention and all its efficient variants. FastRPB has O(N log(N)) computational complexity, requiring O(N) memory w.r.t. input sequence length N.
翻译:变异器在不同领域,包括NLP、CV、音频处理和图解分析,都取得了显著的绩效。 但是,由于输入长度的四边复杂度,在长顺序任务上,它们的规模并不大。 提出了线形变异器来应对这一限制。 但是,这些模型在长顺序任务上的表现比原始的要弱。 在本文中,我们探索线形变异器模型,重新思考其两个核心组成部分。 首先,我们改进了线形变异器,使用 Shift-Invilant Kernel 函数 SIKF, 实现更高的精度,而没有速度损失。 其次,我们引入了快速RPB, 即快速相对定位比对角, 有效地将位置信息添加到使用快速四重转换的自控上。 FastRPB 独立于自控机制, 并且可以与原始的自控和所有高效变体组合组合组合。 快速RPB有O(N) log(N) 计算复杂性, 要求 O(N) 内存(w.r. t) 输入序列 N。