Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms. Further analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results -- using disjoint local attentions, we are able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer~\citep{longformer} with half of its pretraining compute.
翻译:许多NLP任务要求处理超过预先培训模型长度限制的长处。为了扩大这些模型的范围,提出了许多高效的长距离关注变量。尽管沿着这个方向进行了大量研究,但仍然难以衡量这些模型在实际使用案例中的相对效力,例如,如果我们按照标准培训前范式应用这些模型,则在实际使用案例中,仍然难以衡量这些模型的相对效力。在这项工作中,我们的目标是对这些新兴模型进行大规模和受控制的实验进行透彻分析。对于每一个关注变体,我们使用同样的长腔体对大型模型进行预设,然后对这些模型进行精细分析,以完成真实世界的长期任务。我们的调查结果揭示了现有广泛使用的长距离基准的缺陷,并表明在标准培训前范式下,任何经过测试的有效关注都无法击败简单的当地窗口关注。对本地关注变体的进一步分析表明,即使是常用的注意窗口重叠也没有必要实现良好的下游结果 -- -- 使用不协调的本地注意力,我们能够建立一个更简单、更高效的长的长距离模型,而半长的QA模型与长期培训前的表现相匹配。xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx