Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms. Further analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results -- using disjoint local attentions, we are able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer~\citep{longformer} with half of its pretraining compute. The code to replicate our experiments can be found at https://github.com/pytorch/fairseq/tree/main/examples/xformers
翻译:许多NLP任务需要处理超出预先培训模型长度限制的长长的长处。 为了将这些模型缩放为较长的文本序列,已经提出了许多高效的长距离关注变量。尽管沿着这个方向进行了大量研究,但仍然难以衡量这些模型在实际使用案例中的相对效力,例如,如果我们按照标准培训前范式应用这些模型,那么在实际使用中,仍然很难衡量这些模型的相对效力。在这项工作中,我们的目标是对这些新兴模型进行大规模和控制性实验的透彻分析。对于每个关注变体,我们使用同样的长腔体来预设大型模型,然后对这些模型进行精细化,用于真实世界长线任务。我们的调查结果揭示了现有广泛使用的长距离基准的缺陷,并表明在标准培训前范式下,任何经过测试的有效关注都无法击败简单的当地窗口关注。对本地关注变体的进一步分析表明,即使常用的注意窗口重叠对于取得良好的下游结果也没有必要 -- -- 使用不协调的地方关注,我们能够用一个更简单、更高效的长的长的长点/ QA 半级/直径/直径/直径的实验模型,可以匹配我们的长/直径/直径/直径的模型。