Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch.In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
翻译:在这项工作中,我们提议SLED:SLED:SLED:Lidedy-Encoder和Decoder,这是一个处理长序列的简单方法,可以重新使用并利用经过战斗测试的短文本预培训LMS。具体地说,我们将输入的分解成重叠块,每个输入的编码都配有短文本 LM 编码,并使用预先训练的解码器将各块信息(聚解码器)连接到各块(聚解码器)。我们通过受控的实验说明,SLED为长文本理解提供了可行的战略,并评价了我们在SCROLLS上的做法,这是一个基准,有7个数据集,跨越了广泛的语言理解任务。我们发现SLED具有竞争力,专门模型多达50x大,需要专门和昂贵的训练前步骤。