稳定、快速和准确:以相对位置编码的内向关注 (Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding)

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

翻译：关注模块是变异器中一个至关重要的组成部分, 由于其四级复杂度, 无法有效缩放到长序列中。许多工作侧重于在最初关注点中近似点- 点- 点- 点- 点- 点- 点- 点- 点- 点- 变异器结构。然而, 我们显示, 这些方法无法应用到超越点- 点- 点- 点- 点- 点- 点- 风格的更强关注模块, 例如, 具有相对位置编码的变异器( RPE ) 。由于许多最先进的模型, 相对位置编码被使用为默认, 设计能包含 RPE 的高效变异器功能。在本文中, 我们提出一种新颖的方法, 加速对变异变器的变异器的计算。根据观察, 相对定位编码构成托普利茨矩阵的矩阵, 我们可以用快速 Fourier 变异器( FFT) 来有效计算出对变异的注意。由于 FFFFFT, 我们的方法在不甚快的变异化模型中, 我们的变变变异化的模型, 的模型比我们所的变异化的变异化的变现的变现的变现的变现的变现的变异变现的变现的变现的变现的变现的变现的变的变的变的变现的变的模型比我们变的更的变的变现的变现更的变现的变的变的变现的变的变的变现的变的变的变现的变的变现的变现的变的变的变的变的变的变现的变的变的变的变的变的变的变现的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变现的变的变的变的变的变的变的变的变的变的变, 更的变的变的变的变的变的变的变的变的变的变的变的变的变的变