Pretrained language models (PTLMs) are typically learned over a large, static corpus and further fine-tuned for various downstream tasks. However, when deployed in the real world, a PTLM-based model must deal with data distributions that deviate from what the PTLM was initially trained on. In this paper, we study a lifelong language model pretraining challenge where a PTLM is continually updated so as to adapt to emerging data. Over a domain-incremental research paper stream and a chronologically-ordered tweet stream, we incrementally pretrain a PTLM with different continual learning algorithms, and keep track of the downstream task performance (after fine-tuning). We evaluate PTLM's ability to adapt to new corpora while retaining learned knowledge in earlier corpora. Our experiments show distillation-based approaches to be most effective in retaining downstream performance in earlier domains. The algorithms also improve knowledge transfer, allowing models to achieve better downstream performance over the latest data, and improve temporal generalization when distribution gaps exist between training and evaluation because of time. We believe our problem formulation, methods, and analysis will inspire future studies towards continual pretraining of language models.
翻译:预先培训的语言模型(PTLM)通常是在大型、静态的文体中学习的,并针对各种下游任务进行进一步的微调,然而,如果在现实世界中部署,基于PTLM的模式必须处理不同于PTLM最初培训的数据分布。在本文中,我们研究终身语言模型预培训挑战,即PTLM不断更新,以适应新出现的数据。在领域研究论文流和按时间顺序顺序排列的推文流中,我们用不同的持续学习算法对PTLM进行逐步培训,并跟踪下游任务绩效(经过微调后)。我们评估PTLM适应新的团体的能力,同时保留早期公司体学得的知识。我们的实验显示,基于蒸馏的方法对于保留早期领域的下游业绩最为有效。算法还改进了知识转让,使模型能够比最新数据取得更好的下游业绩,并在培训和评价之间因时间而存在分配差距时改进时间化。我们相信,我们的问题、方法和分析将激励未来对语言模型进行持续培训前的研究。