Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-$k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
翻译:在变形器中,在对点产品的关注取得成功之后,最近提出了许多近似方法,以解决投入长度方面的四重复杂程度。虽然这些变式是记忆和计算效率的,但不可能直接使用使用使用香草注意力、不需花费昂贵的纠正前培训阶段而经过培训培训的广受欢迎的培训前语言模型。在这项工作中,我们建议对香草注意采用简单而非常准确的近似方法。我们处理对块的查询,并对每个查询计算对键的评分。我们的方法有若干优点:(a) 它的内存用量在投入大小上是线性的,类似于线性关注变式,例如表演者和RFA(b) 它是一种对香草注意力的低端替换,不需要任何纠正前培训阶段,以及(c) 在将它们投放到熟悉的查询-关键值框架之后,它也能在进食前层中大量节省记忆。我们评估了长距离阿雷纳基准中多头关注层的顶价近似方法的质量,并且对线性关注的变式关注变量进行了类似,例如表演者和RFA的偏差层,从我们设定的粉色和粉底的粉底线上展示了我们的数据。