While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.
翻译:尽管经过培训的大型变异器模型在应对自然语言任务方面证明具有很强的能力,但处理长序输入仍然是一个重大挑战。其中一项任务就是长期输入总结,投入比大多数经过培训的模型的最大输入环境要长。我们通过一系列广泛的实验,调查哪些模型建筑变化和预培训模式能够最有效地将经过培训的变异器改造成长期输入总结。我们发现,一个具有全球编码符号的交错、块状、本地变异器在业绩和效率上取得了良好的平衡,而长序列的附加培训前阶段又有意义地改善了下游的合成性能。基于我们的调查结果,我们引入了PEGASUS-X,这是PEGASUS模型的扩展,并增加了长期输入前培训,以处理多达16K符号的投入。 PEGASUS-X在与大得多的模型相比的长期投入总结性任务上取得了很强的业绩,同时增加了几个额外的参数,不需要模型平行培训。