Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. cosFormer is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme that can concentrate the distribution of the attention matrix. As its linear substitute, cosFormer fulfills these properties with a linear operator and a cosine-based distance re-weighting mechanism. Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. The source code is available at https://github.com/OpenNLPLab/cosFormer.
翻译:变压器在自然语言处理、计算机视觉和音频处理方面取得了巨大成功。 软性关注是其核心组成部分之一, 有助于捕捉长距离依赖性, 但由于音序长度的四边空间和时间复杂性, 也禁止其扩大。 内核方法通常被采用, 通过接近软体操作器来降低复杂性。 然而, 由于近似错误, 其性能因任务/ 体体而异, 与香草软体关注度相比, 其性能会发生重大下降。 在本文中, 我们提议使用一个叫做cos Former 的线性变压器, 它可以在临时和交叉关注中实现与香草变压器的可比或更准确性。 COs Former基于软体注意的两种关键特性: i. 注意矩阵的无偏重度矩阵;ii. 一种非线性重力重力重力计划, 可以集中分配注意矩阵。 作为线性替代品, cosFormer将这些特性与线性操作器的远程再加权调整机制。 在长期性变压法中进行广泛的实验, 我们的模型/ 的 Rormainal rodeal rode roal rode rodu rodustral la la la la la la la la la la la lax la la lax lax lax lax lax lax lax lax