以最高-千元为单位的内存效率高的变换器 (Memory-efficient Transformers via Top-$k$ Attention)

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-$k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.

翻译：在变形器中,在对点产品的关注取得成功之后,最近提出了许多近似方法,以解决投入长度方面的四重复杂程度。虽然这些变式是记忆和计算效率的,但不可能直接使用使用使用香草注意力、不需花费昂贵的纠正前培训阶段而经过培训培训的广受欢迎的培训前语言模型。在这项工作中,我们建议对香草注意采用简单而非常准确的近似方法。我们处理对块的查询,并对每个查询计算对键的评分。我们的方法有若干优点:(a) 它的内存用量在投入大小上是线性的,类似于线性关注变式,例如表演者和RFA(b) 它是一种对香草注意力的低端替换,不需要任何纠正前培训阶段,以及(c) 在将它们投放到熟悉的查询-关键值框架之后,它也能在进食前层中大量节省记忆。我们评估了长距离阿雷纳基准中多头关注层的顶价近似方法的质量,并且对线性关注的变式关注变量进行了类似,例如表演者和RFA的偏差层,从我们设定的粉色和粉底的粉底线上展示了我们的数据。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

最新「注意力机制」大综述论文，66页pdf569篇文献

专知会员服务

209+阅读 · 2021年4月2日

注意力机制综述

专知会员服务

83+阅读 · 2021年1月26日

最新《注意力机制》教程，112页ppt

专知会员服务

322+阅读 · 2020年11月24日