Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
翻译:自Vaswani等人(2017年)引入变压器模型以来,一个根本问题尚未回答:一个模型如何在比培训时更长的序列的推论时间得出外推法?我们首先显示,只要改变位置代表法,就可以进行外推法,尽管我们发现,目前的方法不允许有效外推法。因此,我们引入了一个更简单、更高效的定位方法,即用Linear Biases(ALiBi)来关注定位嵌入;ALiBi没有在单词嵌入中添加定位嵌入;相反,它偏向调心引力计分数,其惩罚与其距离成正比。我们显示,该方法在长度为1024的输入序列上培养了13亿个参数模型,该输入序列外推到长度为2048年的输入序列上,实现与在2048年输入线上训练的正值模型相同的曲解性,但培训速度要快11%,记忆要少11%。 ALiBi的感应偏向适应性也使其超越WikText-103基准上的多重强定位方法。