本地简单注意仍然具有长期任务的竞争竞争力 (Simple Local Attentions Remain Competitive for Long-Context Tasks)

Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms. Further analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results -- using disjoint local attentions, we are able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer~\citep{longformer} with half of its pretraining compute.

翻译：许多NLP任务要求处理超过预先培训模型长度限制的长处。为了扩大这些模型的范围,提出了许多高效的长距离关注变量。尽管沿着这个方向进行了大量研究,但仍然难以衡量这些模型在实际使用案例中的相对效力,例如,如果我们按照标准培训前范式应用这些模型,则在实际使用案例中,仍然难以衡量这些模型的相对效力。在这项工作中,我们的目标是对这些新兴模型进行大规模和受控制的实验进行透彻分析。对于每一个关注变体,我们使用同样的长腔体对大型模型进行预设,然后对这些模型进行精细分析,以完成真实世界的长期任务。我们的调查结果揭示了现有广泛使用的长距离基准的缺陷,并表明在标准培训前范式下,任何经过测试的有效关注都无法击败简单的当地窗口关注。对本地关注变体的进一步分析表明,即使是常用的注意窗口重叠也没有必要实现良好的下游结果 -- -- 使用不协调的本地注意力,我们能够建立一个更简单、更高效的长的长距离模型,而半长的QA模型与长期培训前的表现相匹配。xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【NUS-Xavier教授】注意力神经网络，79页ppt

专知会员服务

65+阅读 · 2021年11月25日

知识图谱推理，50页ppt，Salesforce首席科学家Richard Socher

专知会员服务

111+阅读 · 2020年6月10日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日