Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lessen computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors, to illuminate model details beyond metric scores. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens.
翻译:由于O(N)2时间和空间的复杂性,变异模型无法轻易地缩到长序列中。 这导致变异模型试图降低计算复杂性,例如长效和性能模型。 虽然这些模型在理论上效率更高,但它们在实际NLP任务上的效力没有得到很好研究。 我们用5个困难NLP任务和7个数据集作为变异模型的7个变异模型的基准。 我们设计了实验,以孤立训练前和超参数设置的影响,侧重于其远程关注的能力。 此外,我们提出了各种调查关注行为的方法,以说明模型细节,超越指标分数。我们发现长程变异器的注意在内容选择和查询引导解码方面具有优势,但是它们出现了先前未被忽略的偏差,例如对遥远符号的注意不够。