We present an empirical study of adapting an existing pretrained text-to-text model for long-sequence inputs. Through a comprehensive study along three axes of the pretraining pipeline -- model architecture, optimization objective, and pretraining corpus, we propose an effective recipe to build long-context models from existing short-context models. Specifically, we replace the full attention in transformers with pooling-augmented blockwise attention, and pretrain the model with a masked-span prediction task with spans of varying length. In terms of the pretraining corpus, we find that using randomly concatenated short-documents from a large open-domain corpus results in better performance than using existing long document corpora which are typically limited in their domain coverage. With these findings, we build a long-context model that achieves competitive performance on long-text QA tasks and establishes the new state of the art on five long-text summarization datasets, often outperforming previous methods with larger model sizes. Our code has been released at https://github.com/facebookresearch/bart_ls.
翻译:我们提出了一个经验性研究,以调整现有的经事先训练的文本到文本模型,用于长期序列输入。通过对培训前管道的三个轴线 -- -- 模型结构、优化目标和培训前材料进行综合研究,我们提出了一个从现有的短文本模型中建立长文本模型的有效配方。具体地说,我们用集中强化的分块关注来取代变压器中的全部注意力,用覆盖不同长度的覆盖面的蒙面的预测任务来预演模型。就培训前材料而言,我们发现使用大型开放地块的随机拼凑短文件的结果比利用现有长文件公司的业绩更好,而后者通常在它们的领域覆盖范围有限。有了这些发现,我们建立了一个长文本模型,在长文本QA任务上实现竞争性表现,并建立了五种长文本总称数据集的新状态,通常比以前具有较大模型规模的方法要好。我们的代码已在https://github.com/facebreadresearch/bart_ls上发布。