Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
翻译:在这项工作中,我们提议SLED:SLED:SLED:Lidedy-Encoder和Decoder,这是一个处理长序列的简单方法,可以重新使用并利用经过战斗测试的短文本预培训LMS。具体地说,我们把输入分解成重叠块,每个输入以短文本 LM 编码,并使用经过预先训练的解码器将各块信息(聚解解码器)连接起来。我们通过受控实验说明,SLED为长文本理解提供了可行的战略,并评估了我们在SCROLLS上的做法。 SLED是七套基准,跨越了广泛的语言理解任务。我们发现,SLED具有竞争力,专门模型多达50x大,需要专门和昂贵的培训前步骤。