Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.
翻译:以变异器为基础的模型,如BERT,一直是NLP最成功的深层次学习模型之一。 不幸的是,其核心局限性之一是由于充分注意机制,对序列长度的二次依赖(主要是记忆),因为其完全注意机制。为了纠正这一点,我们提议,BigBird,这是一个将二次依赖减为线性的稀疏关注机制。我们表明,BigBird是序列函数的普遍近似器,正在运行完好,从而保留了四面形、充分注意模型的这些特性。此外,我们的理论分析还揭示了作为分散注意机制的一部分而处理整个序列的一元(1)美元全球符号(如CLS)的一些好处。提议的注意力稀少可以处理将过去使用类似硬件的可能性排到8x的长度的顺序。由于能够处理更长的背景,BigBird大大改进了NLP各项任务(如问题回答和总结)的性能。我们还提议对基因组数据进行新的应用。