Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .
翻译:线性变压器旨在降低香草变压器的二次空间-时间复杂性。 但是,它们通常因各种任务和体积的性能退化而受害。 在本文件中,我们审查现有的内核线性变压器,并找出导致这种性能差距的两个关键问题:(1) 注意计算中的无界梯度对线性变压模型的趋同产生不利影响;(2) 注意淡化,在忽略邻近结构的同时,将注意力微小地分散到长顺序上,忽略了相邻结构。 为了解决这些问题,我们首先确定注意矩阵的缩放是无约束梯度中的魔鬼,而从理论上和实验上看,线性关注是不必要的。为此,我们提议以常规化的梯度取代规模化操作以稳定梯度。关于注意消化的问题,我们利用对角化的注意将注意力限制在早期的相邻的代号上。我们的新线性变压器式变压式变压式变压式变压器在文本和语言任务上表现优异,以及具有挑战性的长- Range-Range- Nranareareal 正在大大超越现有空间变压式。