Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nystr\"omformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nystr\"om method to approximate standard self-attention with $O(n)$ complexity. The scalability of Nystr\"omformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nystr\"omformer performs comparably, or in a few cases, even slightly better, than standard Transformer. Our code is at https://github.com/mlpen/Nystromformer.
翻译:变换器已经成为一种强大的工具, 用于各种自然语言处理任务。 驱动变换器令人印象深刻的性能的一个关键部分是将其他符号在每种特定符号上的影响或依赖性编码的自我注意机制。 输入序列长度上的自我注意的二次复杂性虽然有益,但将其应用限制在更长的顺序上 -- -- 一个正在社区积极研究的专题。 为了解决这一限制, 我们提议 Nystr\\"omfront -- -- 一种模型, 显示在序列长度上具有有利的可缩放性。 我们的想法是以将 Nystr\"om 方法调整为接近标准的自留( $)( n) 复杂性。 Nystr\\\ “ omcalable 允许应用以数千个符号更长的序列。 我们对GLUE 基准的多个下游任务和IMDB 标准序列长度的审查进行评估, 并发现我们的 Nystr\\ omexer 进行可比较性的工作, 或者在少数情况下, 甚至略高于标准变换器。 我们的代码是在 https:// githhubub.com/ mlpen/ Nystrom/ Nystrorstorstormormormormorm 。