大鸟:长序列变形器 (Big Bird: Transformers for Longer Sequences)

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

翻译：以变异器为基础的模型,如BERT,一直是NLP最成功的深层次学习模型之一。不幸的是,其核心局限性之一是由于充分注意机制,对序列长度的二次依赖(主要是记忆),因为其完全注意机制。为了纠正这一点,我们提议,BigBird,这是一个将二次依赖减为线性的稀疏关注机制。我们表明,BigBird是序列函数的普遍近似器,正在运行完好,从而保留了四面形、充分注意模型的这些特性。此外,我们的理论分析还揭示了作为分散注意机制的一部分而处理整个序列的一元(1)美元全球符号(如CLS)的一些好处。提议的注意力稀少可以处理将过去使用类似硬件的可能性排到8x的长度的顺序。由于能够处理更长的背景,BigBird大大改进了NLP各项任务(如问题回答和总结)的性能。我们还提议对基因组数据进行新的应用。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

Transformer模型-深度学习自然语言处理，17页ppt

专知会员服务

107+阅读 · 2020年8月30日

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

专知会员服务

31+阅读 · 2020年5月20日

Transformer文本分类代码

专知会员服务

118+阅读 · 2020年2月3日