Transformer-based models typically have a predefined bound to their input length, because of their need to potentially attend to every token in the input. In this work, we propose Unlimiformer: a general approach that can wrap any existing pretrained encoder-decoder transformer, and offload the attention computation across all layers to a single $k$-nearest-neighbor index; this index can be kept on either the GPU or CPU memory and queried in sub-linear time. This way, we can index extremely long input sequences, while every attention head in every decoder layer retrieves its top-$k$ keys, instead of attending to every key. We demonstrate Unlimiformers's efficacy on several long-document and multi-document summarization benchmarks, showing that it can summarize even 350k token-long inputs from the BookSum dataset, without any input truncation at test time. Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .
翻译:Transformer-based模型通常会对它们的输入长度有一个预定义的限制,因为它们需要考虑到输入中的每个标记。 在这项工作中,我们提出了Unlimiformer:一种通用方法,可以包装任何现有的预先训练的编码器-解码器Transformer,并将注意力计算跨越所有层分配给单个 $k$-nearest-neighbor索引; 该索引可以保留在GPU或CPU内存上并在次线性时间内查询。这样,我们就可以索引极长的输入序列,而每个解码器层中的每个注意力头都检索其前 $k$ 个键,而不是考虑每个键。我们在几个长文档和多文档摘要基准测试中展示了Unlimiformer的效果,表明它可以总结甚至来自BookSum 数据集的长度为350k个标记的输入,而不在测试时截断输入。Unlimiformer通过扩展它们以处理无限输入而无需增加学习的权重或修改它们的代码来改进了预先训练模型,例如BART和Longformer。我们在https://github.com/abertsch72/unlimiformer上公开了我们的代码和模型。