利用再使用变换器来利用冗余 (Leveraging redundancy in attention with Reuse Transformers)

Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers. We systematically analyze the empirical similarity of these scores across heads and layers and find them to be considerably redundant, especially adjacent layers showing high similarity. Motivated by these findings, we propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers. Experiments on a number of standard benchmarks show that reusing attention delivers performance equivalent to or better than standard transformers, while reducing both compute and memory usage.

翻译：以对点产品为基础的对称关注使变异器能够以依赖投入的方式在代号之间交换信息,并且是其在语言和愿景的不同应用中取得成功的关键。但是,典型的变异器模型计算出对称关注的分数,在多个层次的多个层次的多个层次的多个顺序中反复进行。我们系统地分析这些分数在头和层次上的经验相似性,发现它们相当多余,特别是相邻的显示高度相似性的层。我们根据这些发现,建议了一种新颖的结构,重新利用在多个后续层次中一个层次上计算得分的注意力。对一些标准基准的实验显示,重复关注可以产生相当于或比标准变异器更好的性能,同时减少计算和记忆的使用。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【如何做研究】How to research ，22页ppt

专知会员服务

113+阅读 · 2021年4月17日

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日