Transformers struggle when attending to long contexts, since the amount of computation grows with the context length, and therefore they cannot model long-term memories effectively. Several variations have been proposed to alleviate this problem, but they all have a finite memory capacity, being forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the $\infty$-former's attention complexity becomes independent of the context length. Thus, it is able to model arbitrarily long contexts and maintain "sticky memories" while keeping a fixed computation budget. Experiments on a synthetic sorting task demonstrate the ability of the $\infty$-former to retain information from long sequences. We also perform experiments on language modeling, by training a model from scratch and by fine-tuning a pre-trained language model, which show benefits of unbounded long-term memories.
翻译:由于计算量随上下文长长而增加,因此无法有效地模拟长期记忆。为了缓解这一问题,提出了几种变异,但都具有有限的记忆能力,被迫放弃旧信息。在本文中,我们提议美元前列,将香草变压器扩展为无限制长期记忆。通过使用连续空间关注机制处理长期记忆,美元前列的注意力复杂性与上下文长度无关。因此,它能够任意地模拟长环境,保持“粘性记忆”,同时保持固定计算预算。合成分类任务实验显示美元前列有能力保存长序列信息。我们还进行语言建模实验,培训从刮起的模式,并微调一个经过预先训练的语言模型,显示无限制长期记忆的好处。