Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.
翻译:作为回应,我们首先将计算关注图中的线性关注和剩余连接解读为梯度下降步骤。然后,我们对这些组件引入动力,并提议“emph{momentum变压器 ”,它利用动力提高线性变压器的准确性,同时保持线性内存和计算复杂性。此外,我们制定了适应性战略,根据四面优化的最佳势头计算模型的动力值。这种适应性动力消除了寻找最佳动力值和进一步提高动力变压器性能的需要。关于自动递增和非倾斜性任务的一系列实验,包括图像生成和机器翻译,表明动力变压器在培训效率和准确性方面超越了流行的线性线性变压器。