When an NLP model is trained on text data from one time period and tested or deployed on data from another, the resulting temporal misalignment can degrade end-task performance. In this work, we establish a suite of eight diverse tasks across different domains (social media, science papers, news, and reviews) and periods of time (spanning five years or more) to quantify the effects of temporal misalignment. Our study is focused on the ubiquitous setting where a pretrained model is optionally adapted through continued domain-specific pretraining, followed by task-specific finetuning. We establish a suite of tasks across multiple domains to study temporal misalignment in modern NLP systems. We find stronger effects of temporal misalignment on task performance than have been previously reported. We also find that, while temporal adaptation through continued pretraining can help, these gains are small compared to task-specific finetuning on data from the target time period. Our findings motivate continued research to improve temporal robustness of NLP models.
翻译:当NLP模型在一个时间段的文本数据上接受培训,并在另一个时间段测试或部署数据时,由此产生的时间错配会降低最终任务性能。在这项工作中,我们在不同领域(社会媒体、科学论文、新闻和评论)和时间段(五年或五年以上)建立一套八种不同的任务,以量化时间错配的影响。我们的研究侧重于无处不在的环境,即预先培训的模型通过持续的特定域前培训,可选择调整,随后进行具体任务微调。我们建立了一套跨多个领域的任务,以研究现代NLP系统中的时间错配。我们发现,时间错配对任务性效果的影响比以前报告的情况要大。我们还发现,尽管通过持续培训前的适应可以帮助时间,但这些收益比目标期间数据的具体微调要小。我们的研究鼓励继续开展研究,以提高NLP模型的时间性坚固性。