Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc. Consequently, several approaches have been developed recently to speedup attention computation by enforcing different attention structures such as sparsity, low-rank, approximating attention using kernels. In this work, we view attention computation as that of nearest neighbor retrieval, and use decision tree based hierarchical navigation to reduce the retrieval cost per query token from linear in sequence length to nearly logarithmic. Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention. TF-Attention computes the attention in a fine-grained style, while TC-Attention is a coarse attention layer which also ensures that the gradients are "dense". To optimize such challenging discrete layers, we propose a two-level bootstrapped training method. Using extensive experiments on standard NLP benchmarks, especially for long-sequences, we demonstrate that our Treeformer architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer. Compared to Linformer, the accuracy can be as much as 12% higher while using similar FLOPs in the attention layer.
翻译:标准的Transformer模型推理和训练随着输入序列长度呈二次扩展。这在一些应用(如网页翻译、问题回答等)中变得非常大。因此,最近出现了许多方法来加速注意力计算,如强制不同的注意力结构(如稀疏、低秩、使用核函数逼近注意力)。在本文中,我们将注意力计算视为最近邻检索问题,并使用基于决策树的分层导航,将每个查询标记的检索成本从序列长度的线性减少到几乎对数级别。基于这种分层导航,我们设计了Treeformer,它可以使用两种高效的注意力层——TF-Attention和TC-Attention。TF-Attention以精细的方式计算注意力,而TC-Attention是一种粗略的注意力层,还确保梯度是“密集的”。为了优化这些具有挑战性的离散层,我们提出了一种两级Bootstrap训练方法。通过在标准NLP基准测试中进行大量实验,特别是针对长序列,我们证明了我们的Treeformer体系结构在使用30倍较少的注意力层FLOPs时几乎与基线Transformer同样准确。与Linformer相比,在使用类似的注意力层FLOPs时,精度可以高达12%。