We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.
翻译:我们描述了一种高效的等级方法,用以计算变异器结构中的注意程度。提议的注意机制利用了一个与数字分析界开发的等级矩阵(H-Matrix)相似的矩阵结构,并具有线性运行时间和记忆复杂性。我们进行了广泛的实验,以表明我们分级关注所表现的感应偏差能够有效地捕捉自然语言和视觉任务典型序列中的等级结构。我们的方法优于替代的次水道结构,在长距离阿雷纳基准中平均超过+6点。它还为一亿字数据集设定了新的SOTA测试,其模型参数比以前最佳变异器模型少5x倍。