Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.
翻译:以变换器为基础的模型无法处理长序列, 原因是其自省操作, 以序列长度为尺度。 为了应对这一限制, 我们引入长源值, 关注机制以线性方式以序列长度缩放, 便于处理数千个物证或更长的时间文件。 长源值的注意机制是标准自省文件的自动替换, 将本地窗口关注与任务驱动的全球关注结合起来。 在对长序列变换器进行先前的工作之后, 我们评估了字符级语言模型的长源值, 并实现了文本8 和 enwik8 的最新结果。 与大多数先前的工作相比, 我们还对长源值进行了前线性调整, 并精细化了下游任务。 我们的先导长源始终在长文档任务上优于 RoBERTA, 并设定了关于 WikiHop 和 TriviaQA 的新的最新技术结果 。 我们最后引入了长序列- Encoder- Decoder (LED), 用于支持长源化文件长级序列到后继任务的长期变量, 并展示其数据的有效性。