Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers. We systematically analyze the empirical similarity of these scores across heads and layers and find them to be considerably redundant, especially adjacent layers showing high similarity. Motivated by these findings, we propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers. Experiments on a number of standard benchmarks show that reusing attention delivers performance equivalent to or better than standard transformers, while reducing both compute and memory usage.
翻译:以对点产品为基础的对称关注使变异器能够以依赖投入的方式在代号之间交换信息,并且是其在语言和愿景的不同应用中取得成功的关键。 但是,典型的变异器模型计算出对称关注的分数,在多个层次的多个层次的多个层次的多个顺序中反复进行。我们系统地分析这些分数在头和层次上的经验相似性,发现它们相当多余,特别是相邻的显示高度相似性的层。我们根据这些发现,建议了一种新颖的结构,重新利用在多个后续层次中一个层次上计算得分的注意力。对一些标准基准的实验显示,重复关注可以产生相当于或比标准变异器更好的性能,同时减少计算和记忆的使用。